ProkBERT Tokenizer

These utils contains the tools needed by the ProkBERT tokenizer

ProkBERTTokenizer

prokbert.prokbert_tokenizer.load_vocab(...)

Loads a vocabulary file into a dictionary.

`prokbert.prokbert_tokenizer.ProkBERTTokenizer`([...])	Custom tokenizer for ProkBERT.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.tokenize`(text)	Tokenizes a given segment.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.depr_convert_ids_to_tokens`(ids)	Converts tokens to their corresponding IDs.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.convert_ids_to_tokens`(ids)	Converts token IDs back to their original tokens.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.save_vocabulary`(...)	Saves the vocabulary to a file.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.from_pretrained`(...)	Loads a pre-trained tokenizer.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode_plus`(text)	Tokenizes a sequence and returns it in a format suitable for model input.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode`(segment)	Encode a DNA sequence into its corresponding token IDs.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.decode`(ids)	Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.batch_decode`(...)	Decodes multiple token ID sequences back into their original sequences.
`prokbert.prokbert_tokenizer.ProkBERTTokenizer.get_positions_tokens`(...)	Get tokens containing the nucleotide at the given position.

class prokbert.prokbert_tokenizer.ProkBERTTokenizer(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)

Bases: PreTrainedTokenizer

Custom tokenizer for ProkBERT.

batch_decode(token_ids_list: List[List[int]], **kwargs) → List[str]

Decodes multiple token ID sequences back into their original sequences.

Args:

token_ids_list (List[List[int]]): List of token ID sequences.

Returns:

List[str]: List of decoded sequences.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> ids = [[2, 213, 3343, 165, 2580, 248, 3905, 978, 3296, 3]]
>>> sequences = tokenizer.batch_decode(ids)
>>> print(sequences)
...

batch_encode_plus(sequences: List[str], lca_shift: int = 0, all: bool = False, **kwargs) → Dict[str, List[List[int]]]

Tokenizes multiple sequences and returns them in a format suitable for model input. It is assumed that sequences have already been preprocessed (i.e., segmented) and quality controlled.

Args: - sequences (List[str]): A list of DNA sequences to be tokenized. - lca_shift (int, default=0): The LCA offset or windows to get the tokenized vector. If the required offset is >= shift, an error is raised. - all (bool, default=False): Whether all possible tokenization vectors should be returned. If False, only the specified offset is used. - **kwargs: Additional arguments (like max_length, padding, etc.)

Returns: - Dict[str, List[List[int]]]: A dictionary containing token IDs, attention masks, and token type IDs.

convert_ids_to_tokens(ids: Union[int, List[int]]) → Union[str, List[str]]

Converts token IDs back to their original tokens.

Args:

ids (List[int]): List of token IDs to convert.

Returns:

List[str]: List of corresponding tokens.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> ids = [213, 3343]
>>> tokens = tokenizer.convert_ids_to_tokens(ids)
>>> print(tokens)
...

decode(ids)

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Args:

token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]):: List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False):: Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional):: Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional):: Will be passed to the underlying model specific decode method.

Returns:

str: The decoded sentence.

depr_convert_ids_to_tokens(ids: Union[int, List[int]]) → List[str]

Converts tokens to their corresponding IDs.

Args:

tokens (List[str]): List of tokens to convert.

Returns:

List[int]: List of corresponding token IDs.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> tokens = ['AATCAA', 'TCAAGG']
>>> ids = tokenizer.convert_tokens_to_ids(tokens)
>>> print(ids)
...

encode(segment: str, lca_shift: int = 0, all: bool = False, add_special_tokens: bool = True, **kwargs) → List[int]

Encode a DNA sequence into its corresponding token IDs.

Args:

text (str): The DNA segment to encode. add_special_tokens (bool, optional): Whether to add special tokens like [CLS] and [SEP]. Defaults to True.

Returns:

List[int]: Encoded token IDs.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> ids = tokenizer.encode(segment)
>>> print(ids)
...

encode_plus(text: str, lca_shift: int = 0, **kwargs) → Dict[str, ndarray]

Tokenizes a sequence and returns it in a format suitable for model input.

Args:

text (str): The sequence to tokenize. lca_shift (int, optional): LCA offset for tokenization. Defaults to 0.

Returns:

Dict[str, np.ndarray]: Dictionary containing token IDs and attention masks.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> encoded = tokenizer.encode_plus(segment)
>>> print(encoded)
...

classmethod from_pretrained(vocab_file: str) → ProkBERTTokenizer

Loads a pre-trained tokenizer.

Args:: vocab_file (str): Path to the pre-trained tokenizer vocabulary file.
Returns:: ProkBERTTokenizer: Loaded tokenizer instance.

get_positions_tokens(sequence: str, position: int) → List[str]

Get tokens containing the nucleotide at the given position.

Args:

sequence (str): Sequence position (int): Position of the character.

Returns:

List[str]: List of tokens containing the character at the specified position.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> position = 8
>>> sequence = "AACTGTGATCTGA"
>>> tokens = tokenizer.get_positions_tokens(sequence, position)
>>> print(tokens)
...

save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) → Tuple[str]: Saves the vocabulary to a file.

tokenize(text: str, lca_shift: int = 0, all: bool = False) → Union[List[str], Tuple[List[List[str]], List[List[str]]]]

Tokenizes a given segment.

Args:

text (str): The DNA segment to tokenize. lca_shift (int, optional): Which tokenized vector belonging to the specified LCA offset should be returned. Defaults to 0. all (bool, optional): If True, returns all possible tokenizations. Defaults to False.

Returns:

Union[List[str], Tuple[List[List[str]], List[List[str]]]]: Tokenized segment or tuple of all possible tokenizations.

Usage Example:

>>> tokenizer = ProkBERTTokenizer(...)
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> tokens, kmers = tokenizer.tokenize(segment, all=True)
>>> print(tokens)
...