ProkBERT Tokenizer

This module contains the tools required by the ProkBERT tokenizer. The ProkBERTTokenizer class is designed to handle specific tokenization processes required for ProkBERT, including LCA tokenization and sequence segmentation.

ProkBERTTokenizer Class and Methods

class prokbert.prokbert_tokenizer.ProkBERTTokenizer(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)

Bases: PythonBackend

Custom tokenizer for ProkBERT, handling specific tokenization processes required for ProkBERT, including LCA tokenization and sequence segmentation. ProkBERT employs LCA tokenization, leveraging overlapping k-mers to capture rich local context information, enhancing model generalization and performance. The key parameters are the k-mer size and shift. For instance, with a k-mer size of 6 and a shift of 1, the tokenization captures detailed sequence information, while a k-mer size of 1 represents a basic character-based approach.

Parameters
  • tokenization_params (dict) – Parameters for tokenization, derived from the ‘tokenization’ part of the config. Expected keys include ‘type’, ‘kmer’, ‘shift’, etc. See below for detailed descriptions.

  • segmentation_params (dict) – Parameters for segmentation, derived from the ‘segmentation’ part of the config. Includes ‘type’, ‘min_length’, ‘max_length’, etc.

  • comp_params (dict) – Computation parameters from the ‘computation’ part of the config, like CPU cores and batch sizes.

  • operation_space (str) – Defines the operation space (‘sequence’ or ‘kmer’).

Tokenization Parameters:
  • type (str): Tokenization approach, default ‘lca’ for Local Context Aware.

  • kmer (int): k-mer size for tokenization.

  • shift (int): Shift parameter in k-mer.

  • max_segment_length (int): Maximum number of characters in a segment.

  • token_limit (int): Maximum token count for language model processing.

  • max_unknown_token_proportion (float): Maximum allowed proportion of unknown tokens.

  • vocabfile (str): Path to the vocabulary file.

  • isPaddingToMaxLength (bool): Whether to pad sentences to a fixed length.

  • add_special_token (bool): Whether to add special tokens like [CLS], [SEP].

Segmentation Parameters:
  • type (str): Segmentation type, ‘contiguous’ or ‘random’.

  • min_length (int): Minimum length for a segment.

  • max_length (int): Maximum length for a segment.

  • coverage (float): Expected average coverage of positions in the sequence.

Computation Parameters:
  • cpu_cores_for_segmentation (int): Number of CPU cores for segmentation.

  • cpu_cores_for_tokenization (int): Number of CPU cores for tokenization.

  • batch_size_tokenization (int): Batch size for tokenization.

  • batch_size_fasta_segmentation (int): Batch size for fasta file processing.

  • numpy_token_integer_prec_byte (int): Integer precision byte for vectorization.

  • np_tokentype (type): Data type for numpy token arrays.

Usage Example:
>>> tokenization_parameters = {'kmer': 6, 'shift': 1}
>>> tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters)
>>> encoded = tokenizer('ATTCTTT')
>>> print(encoded)
batch_decode(token_ids_list: List[List[int]], **kwargs) List[str]

Decodes multiple token ID sequences back into their original DNA sequences.

This method converts each list of token IDs in the batch back to its corresponding sequence.

Parameters

token_ids_list (List[List[int]]) – A list of token ID sequences to be decoded. Each element in the list is a list of token IDs.

Returns

A list containing the decoded DNA sequences.

Return type

List[str]

Usage Example:
>>> tokenizer = ProkBERTTokenizer(...)
>>> ids = [[2, 213, 3343, 165, 2580, 248, 3905, 978, 3296, 3]]
>>> sequences = tokenizer.batch_decode(ids)
>>> print(sequences)
...
batch_encode_plus(batch_text_or_text_pairs: List[str], lca_shift: int = 0, all: bool = False, **kwargs) Dict[str, List[List[int]]]

Tokenizes multiple sequences and returns them in a format suitable for model input. Assumes that sequences have already been preprocessed (segmented) and quality controlled (i.e. all uppercase).

Parameters
  • batch_text_or_text_pairs (List[str]) – A list of DNA sequences to be tokenized.

  • lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.

  • all (bool, optional) – If True, returns all possible tokenization vectors. If False, only the tokenization corresponding to the specified offset is used. Defaults to False.

  • return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).

Returns

A dictionary containing ‘input_ids’, ‘token_type_ids’, and ‘attention_mask’ as lists of lists. If ‘return_tensors’ is specified as ‘pt’, these are returned as PyTorch tensors.

Return type

Dict[str, List[List[int]]]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> segments = ['AATCAAGGA', 'ATTATTATCGTT']
>>> encoded = tokenizer.batch_encode_plus(segments)
>>> print(encoded['input_ids'])
...
convert_ids_to_tokens(ids: Union[int, List[int], Tensor]) Union[str, List[str]]

Converts token IDs back to their original tokens. This function can handle a single ID or a list of IDs. It also supports handling IDs provided as a PyTorch tensor.

Parameters

ids (Union[int, List[int], torch.Tensor]) – A single token ID or a list of token IDs. Can also be a PyTorch tensor of token IDs.

Returns

The corresponding token or list of tokens. If ids is a single integer or a tensor with a single value, a single token string is returned. If ids is a list or tensor with multiple values, a list of token strings is returned.

Return type

Union[str, List[str]]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> ids = [213, 3343]
>>> tokens = tokenizer.convert_ids_to_tokens(ids)
>>> print(tokens)
...
decode(ids: Union[List[int], Tensor]) str

Decodes a list of token IDs back to the original DNA sequence.

This method converts token IDs back to their corresponding tokens and then concatenates them to form the original sequence. It is capable of handling token IDs provided as a list or a PyTorch tensor.

Parameters

ids (Union[List[int], torch.Tensor]) – Token IDs to be decoded. Can be a list of integers or a PyTorch tensor.

Returns

The decoded DNA sequence as a string.

Return type

str

Usage Example:
>>> tokenizer = ProkBERTTokenizer(...)
>>> ids = [213, 3343]
>>> sequence = tokenizer.decode(ids)
>>> print(sequence)
...
default_cls_token = '[CLS]'
default_mask_token = '[MASK]'
default_pad_token = '[PAD]'
default_sep_token = '[SEP]'
default_unk_token = '[UNK]'
encode(segment: str, lca_shift: int = 0, all: bool = False, add_special_tokens: bool = True, **kwargs) List[int]

Encode a DNA sequence into its corresponding token IDs using LCA tokenization.

Parameters
  • segment (str) – The DNA segment to encode.

  • lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.

  • all (bool, optional) – If True, returns all possible tokenizations. Defaults to False.

  • add_special_tokens (bool, optional) – Whether to add special tokens like [CLS] and [SEP]. Defaults to True.

Returns

A list of encoded token IDs. If ‘all’ is True, a list of lists is returned for each possible tokenization.

Return type

List[int]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> ids = tokenizer.encode(segment)
>>> print(ids)
...
encode_plus(text: str, lca_shift: int = 0, padding_to_max: bool = False, **kwargs) Dict[str, Union[Tensor, ndarray]]

Tokenizes a sequence and returns it in a format suitable for model input, including attention masks.

Parameters
  • text (str) – The DNA sequence to tokenize.

  • lca_shift (int, optional) – LCA offset for tokenization. Specifies the offset in the tokenization process, defaults to 0.

  • padding_to_max (bool, optional) – If True, pads the tokenized sequence to the maximum length. Defaults to False.

  • return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).

Returns

A dictionary containing ‘input_ids’ and ‘attention_mask’, either as numpy arrays or PyTorch tensors, based on ‘return_tensors’ parameter.

Return type

Dict[str, Union[torch.Tensor, np.ndarray]]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> encoded = tokenizer.encode_plus(segment)
>>> print(encoded['input_ids'])
...
extended_nucleotide_abc = {'*', 'A', 'C', 'G', 'T'}
classmethod from_pretrained(vocab_file: str) ProkBERTTokenizer

Loads a pre-trained tokenizer.

Args:

vocab_file (str): Path to the pre-trained tokenizer vocabulary file.

Returns:

ProkBERTTokenizer: Loaded tokenizer instance.

get_positions_tokens(sequence: str, position: int) List[str]

Get tokens containing the nucleotide at the given position.

Args:

sequence (str): Sequence position (int): Position of the character.

Returns:

List[str]: List of tokens containing the character at the specified position.

Usage Example:
>>> tokenizer = ProkBERTTokenizer(...)
>>> position = 8
>>> sequence = "AACTGTGATCTGA"
>>> tokens = tokenizer.get_positions_tokens(sequence, position)
>>> print(tokens)
...
get_vocab() Dict[str, int]

Returns the vocabulary dictionary used by the tokenizer.

This method provides access to the tokenizer’s vocabulary, which maps tokens to their corresponding IDs.

Returns

The vocabulary dictionary.

Return type

Dict[str, int]

max_model_input_sizes = {'prokbert-mini-k1s1': 1024, 'prokbert-mini-k6s1': 1024, 'prokbert-mini-k6s2': 2048}
nucleotide_abc = {'A', 'C', 'G', 'T'}
pretrained_init_configuration = {'prokbert-mini-k1s1': {'do_upper_case': True}, 'prokbert-mini-k6s1': {'do_upper_case': True}, 'prokbert-mini-k6s2': {'do_upper_case': True}}
pretrained_vocab_files_map: dict[str, dict[str, str]] = {'vocab_file': {'prokbert-mini-k1s1': 'prokbert-base-dna1/vocab.txt', 'prokbert-mini-k6s1': 'prokbert-base-dna6/vocab.txt', 'prokbert-mini-k6s2': 'prokbert-base-dna6/vocab.txt'}}
save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) Tuple[str]

Saves the tokenizer’s vocabulary to a file in the specified directory.

This method writes the vocabulary tokens to a text file, with each token on a new line. The filename can be prefixed with an optional string for clearer identification.

Parameters
  • save_directory (str) – The directory where the vocabulary file will be saved.

  • filename_prefix (Optional[str]) – An optional prefix to the filename of the vocabulary file. Defaults to None, which means no prefix is added.

Returns

A tuple containing the path to the saved vocabulary file.

Return type

Tuple[str]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> saved_path = tokenizer.save_vocabulary("/path/to/save", filename_prefix="prokbert_")
>>> print(saved_path)
...
sequence_unk_token = 'N'
tokenize(text: str, lca_shift: int = 0, all: bool = False) Union[List[str], Tuple[List[List[str]], List[List[str]]]]

Tokenizes a given DNA segment using the Local Context Aware (LCA) approach. Depending on the parameters, it can return either a single tokenized sequence or all possible tokenizations.

Parameters
  • text (str) – The DNA segment to be tokenized. Represents a sequence of nucleotides.

  • lca_shift (int, optional) – Specifies the LCA offset for tokenization. Determines which specific tokenized vector to return. A value of 0 returns the first vector, while higher values return subsequent vectors. Defaults to 0.

  • all (bool, optional) – If set to True, the method returns all possible tokenizations for the given segment. When False, it returns only the tokenized vector corresponding to the specified LCA offset.

Returns

If ‘all’ is False, returns a list of tokenized segments corresponding to the specified LCA shift. If ‘all’ is True, returns a tuple containing two lists: one with all possible tokenized segments, and the other with k-merized segments.

Return type

Union[List[str], Tuple[List[List[str]], List[List[str]]]]

Usage Example:
>>> tokenizer = ProkBERTTokenizer(tokenization_params={'kmer': 6, 'shift': 1}, operation_space='sequence')
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> tokens, kmers = tokenizer.tokenize(segment, all=True)
>>> print(tokens)
...
vocab_files_names: dict[str, str] = {'vocab_file': 'vocab.txt'}
prokbert.prokbert_tokenizer.load_vocab(vocab_file)

Loads a vocabulary file into a dictionary.

The ProkBERTTokenizer class inherits from the standard tokenizer classes and includes additional methods specific to ProkBERT’s requirements.

Additionally, below are more detailed listings and descriptions of the individual methods within the ProkBERTTokenizer class: