SeqUtils

Library for sequence processing

prokbert.sequtils.load_contigs(fasta_files_list)

Loads contigs from a list of FASTA files.

prokbert.sequtils.segment_sequence_contiguous(...)

Creates end-to-end, disjoint segments of a sequence without overlaps.

prokbert.sequtils.segment_sequences_random(...)

Randomly segments the input sequences.

prokbert.sequtils.segment_sequences(...[, ...])

Segments sequences based on the provided parameters.

prokbert.sequtils.lca_tokenize_segment(...)

Tokenizes a single segment using Local Context Aware (LCA) tokenization.

prokbert.sequtils.tokenize_kmerized_segment_list(...)

Tokenizes or vectorizes a list of k-merized segments into a list of token vectors.

prokbert.sequtils.process_batch_tokenize_segments_with_ids(...)

Tokenizes a batch of segments and associates them with their provided IDs.

prokbert.sequtils.batch_tokenize_segments_with_ids(...)

Parallel tokenization of segments with associated IDs.

prokbert.sequtils.get_rectangular_array_from_tokenized_dataset(...)

Create a rectangular numpy array that can be used as input to a Language Model (LM) from tokenized segment data.

prokbert.sequtils.pretty_print_overlapping_sequence(...)

Format the sequence for pretty printing with overlapping k-mers.

prokbert.sequtils.generate_kmers(abc, k)

Generates all possible k-mers from a given alphabet.

prokbert.sequtils.save_to_hdf(X, hdf_file_path)

Save a numpy array and an optional pandas DataFrame to an HDF5 file.