ProkBERT Tokenizer
These utils contains the tools needed by the ProkBERT tokenizer
ProkBERTTokenizer
Loads a vocabulary file into a dictionary. |
Custom tokenizer for ProkBERT. |
|
|
Tokenizes a given segment. |
|
Converts tokens to their corresponding IDs. |
|
Converts token IDs back to their original tokens. |
|
Saves the vocabulary to a file. |
|
Loads a pre-trained tokenizer. |
|
Tokenizes a sequence and returns it in a format suitable for model input. |
|
Encode a DNA sequence into its corresponding token IDs. |
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. |
|
|
Decodes multiple token ID sequences back into their original sequences. |
|
Get tokens containing the nucleotide at the given position. |
- class prokbert.prokbert_tokenizer.ProkBERTTokenizer(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)
Bases:
PreTrainedTokenizerCustom tokenizer for ProkBERT.
- batch_decode(token_ids_list: List[List[int]], **kwargs) List[str]
Decodes multiple token ID sequences back into their original sequences.
- Args:
token_ids_list (List[List[int]]): List of token ID sequences.
- Returns:
List[str]: List of decoded sequences.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> ids = [[2, 213, 3343, 165, 2580, 248, 3905, 978, 3296, 3]] >>> sequences = tokenizer.batch_decode(ids) >>> print(sequences) ...
- batch_encode_plus(sequences: List[str], lca_shift: int = 0, all: bool = False, **kwargs) Dict[str, List[List[int]]]
Tokenizes multiple sequences and returns them in a format suitable for model input. It is assumed that sequences have already been preprocessed (i.e., segmented) and quality controlled.
Args: - sequences (List[str]): A list of DNA sequences to be tokenized. - lca_shift (int, default=0): The LCA offset or windows to get the tokenized vector. If the required offset is >= shift, an error is raised. - all (bool, default=False): Whether all possible tokenization vectors should be returned. If False, only the specified offset is used. - **kwargs: Additional arguments (like max_length, padding, etc.)
Returns: - Dict[str, List[List[int]]]: A dictionary containing token IDs, attention masks, and token type IDs.
- convert_ids_to_tokens(ids: Union[int, List[int]]) Union[str, List[str]]
Converts token IDs back to their original tokens.
- Args:
ids (List[int]): List of token IDs to convert.
- Returns:
List[str]: List of corresponding tokens.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> ids = [213, 3343] >>> tokens = tokenizer.convert_ids_to_tokens(ids) >>> print(tokens) ...
- decode(ids)
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).
- Args:
- token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]):
List of tokenized input ids. Can be obtained using the __call__ method.
- skip_special_tokens (bool, optional, defaults to False):
Whether or not to remove special tokens in the decoding.
- clean_up_tokenization_spaces (bool, optional):
Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
- kwargs (additional keyword arguments, optional):
Will be passed to the underlying model specific decode method.
- Returns:
str: The decoded sentence.
- depr_convert_ids_to_tokens(ids: Union[int, List[int]]) List[str]
Converts tokens to their corresponding IDs.
- Args:
tokens (List[str]): List of tokens to convert.
- Returns:
List[int]: List of corresponding token IDs.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> tokens = ['AATCAA', 'TCAAGG'] >>> ids = tokenizer.convert_tokens_to_ids(tokens) >>> print(ids) ...
- encode(segment: str, lca_shift: int = 0, all: bool = False, add_special_tokens: bool = True, **kwargs) List[int]
Encode a DNA sequence into its corresponding token IDs.
- Args:
text (str): The DNA segment to encode. add_special_tokens (bool, optional): Whether to add special tokens like [CLS] and [SEP]. Defaults to True.
- Returns:
List[int]: Encoded token IDs.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> segment = 'AATCAAGGAATTATTATCGTT' >>> ids = tokenizer.encode(segment) >>> print(ids) ...
- encode_plus(text: str, lca_shift: int = 0, **kwargs) Dict[str, ndarray]
Tokenizes a sequence and returns it in a format suitable for model input.
- Args:
text (str): The sequence to tokenize. lca_shift (int, optional): LCA offset for tokenization. Defaults to 0.
- Returns:
Dict[str, np.ndarray]: Dictionary containing token IDs and attention masks.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> segment = 'AATCAAGGAATTATTATCGTT' >>> encoded = tokenizer.encode_plus(segment) >>> print(encoded) ...
- classmethod from_pretrained(vocab_file: str) ProkBERTTokenizer
Loads a pre-trained tokenizer.
- Args:
vocab_file (str): Path to the pre-trained tokenizer vocabulary file.
- Returns:
ProkBERTTokenizer: Loaded tokenizer instance.
- get_positions_tokens(sequence: str, position: int) List[str]
Get tokens containing the nucleotide at the given position.
- Args:
sequence (str): Sequence position (int): Position of the character.
- Returns:
List[str]: List of tokens containing the character at the specified position.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> position = 8 >>> sequence = "AACTGTGATCTGA" >>> tokens = tokenizer.get_positions_tokens(sequence, position) >>> print(tokens) ...
- save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) Tuple[str]
Saves the vocabulary to a file.
- tokenize(text: str, lca_shift: int = 0, all: bool = False) Union[List[str], Tuple[List[List[str]], List[List[str]]]]
Tokenizes a given segment.
- Args:
text (str): The DNA segment to tokenize. lca_shift (int, optional): Which tokenized vector belonging to the specified LCA offset should be returned. Defaults to 0. all (bool, optional): If True, returns all possible tokenizations. Defaults to False.
- Returns:
Union[List[str], Tuple[List[List[str]], List[List[str]]]]: Tokenized segment or tuple of all possible tokenizations.
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(...) >>> segment = 'AATCAAGGAATTATTATCGTT' >>> tokens, kmers = tokenizer.tokenize(segment, all=True) >>> print(tokens) ...