prokbert.prokbert_tokenizer.ProkBERTTokenizer

class prokbert.prokbert_tokenizer.ProkBERTTokenizer(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)

Custom tokenizer for ProkBERT, handling specific tokenization processes required for ProkBERT, including LCA tokenization and sequence segmentation. ProkBERT employs LCA tokenization, leveraging overlapping k-mers to capture rich local context information, enhancing model generalization and performance. The key parameters are the k-mer size and shift. For instance, with a k-mer size of 6 and a shift of 1, the tokenization captures detailed sequence information, while a k-mer size of 1 represents a basic character-based approach.

Parameters

tokenization_params (dict) – Parameters for tokenization, derived from the ‘tokenization’ part of the config. Expected keys include ‘type’, ‘kmer’, ‘shift’, etc. See below for detailed descriptions.
segmentation_params (dict) – Parameters for segmentation, derived from the ‘segmentation’ part of the config. Includes ‘type’, ‘min_length’, ‘max_length’, etc.
comp_params (dict) – Computation parameters from the ‘computation’ part of the config, like CPU cores and batch sizes.
operation_space (str) – Defines the operation space (‘sequence’ or ‘kmer’).

Tokenization Parameters:

type (str): Tokenization approach, default ‘lca’ for Local Context Aware.
kmer (int): k-mer size for tokenization.
shift (int): Shift parameter in k-mer.
max_segment_length (int): Maximum number of characters in a segment.
token_limit (int): Maximum token count for language model processing.
max_unknown_token_proportion (float): Maximum allowed proportion of unknown tokens.
vocabfile (str): Path to the vocabulary file.
isPaddingToMaxLength (bool): Whether to pad sentences to a fixed length.
add_special_token (bool): Whether to add special tokens like [CLS], [SEP].

Segmentation Parameters:

type (str): Segmentation type, ‘contiguous’ or ‘random’.
min_length (int): Minimum length for a segment.
max_length (int): Maximum length for a segment.
coverage (float): Expected average coverage of positions in the sequence.

Computation Parameters:

cpu_cores_for_segmentation (int): Number of CPU cores for segmentation.
cpu_cores_for_tokenization (int): Number of CPU cores for tokenization.
batch_size_tokenization (int): Batch size for tokenization.
batch_size_fasta_segmentation (int): Batch size for fasta file processing.
numpy_token_integer_prec_byte (int): Integer precision byte for vectorization.
np_tokentype (type): Data type for numpy token arrays.

Usage Example:

>>> tokenization_parameters = {'kmer': 6, 'shift': 1}
>>> tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters)
>>> encoded = tokenizer('ATTCTTT')
>>> print(encoded)

__init__(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)

Parameters

tokenization_params (Dict) – Dictionary containing tokenization parameters such as k-mer size, shift, max segment length, and more. Defaults to an empty dictionary.
segmentation_params (Dict) – Dictionary containing segmentation parameters like type, min/max length, and coverage. Defaults to an empty dictionary.
comp_params (Dict) – Dictionary containing computational parameters as described above
operation_space (str) – Specifies the operation mode, which can be either ‘kmer’ or ‘sequence’. Defaults to ‘sequence’.

The class supports extended vocabulary and custom unknown tokens for sequence-based operation, and aligns with standard tokenization protocols for language models.

Returns: None

Example:

>>> tokenizer = ProkBERTTokenizer(tokenization_params={'kmer': 6, 'shift': 1}, operation_space='sequence')
>>> tokenizer.tokenize("ACGTACGT")

Methods

`__init__`([tokenization_params, ...])	param tokenization_params Dictionary containing tokenization parameters such as k-mer size,
`add_special_tokens`(special_tokens_dict[, ...])	Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes.
`add_tokens`(new_tokens[, special_tokens])	#TODO remove this from here! PreTrainedTOkeniuzerBase should be agnostic of AddedToken.
`apply_chat_template`(conversation[, tools, ...])	Converts a list of dictionaries with "role" and "content" keys to a list of token ids.
`batch_decode`(token_ids_list, **kwargs)	Decodes multiple token ID sequences back into their original DNA sequences.
`batch_encode_plus`(batch_text_or_text_pairs)	Tokenizes multiple sequences and returns them in a format suitable for model input.
`build_inputs_with_special_tokens`(token_ids_0)	Build model inputs from a sequence or a pair of sequences by adding special tokens.
`clean_up_tokenization`(text)	Clean up tokenization spaces in a given text.
`convert_added_tokens`(obj[, save, add_type_field])
`convert_ids_to_tokens`(ids)	Converts token IDs back to their original tokens.
`convert_to_native_format`(**kwargs)
`convert_tokens_to_ids`(tokens)	Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
`convert_tokens_to_string`(tokens)	Converts a sequence of tokens in a single string.
`create_token_type_ids_from_sequences`(token_ids_0)	Create a mask from the two sequences passed to be used in a sequence-pair classification task.
`decode`(ids)	Decodes a list of token IDs back to the original DNA sequence.
`encode`(segment[, lca_shift, all, ...])	Encode a DNA sequence into its corresponding token IDs using LCA tokenization.
`encode_message_with_chat_template`(message[, ...])	Tokenize a single message. This method is a convenience wrapper around apply_chat_template that allows you to tokenize messages one by one. This is useful for things like token-by-token streaming. This method is not guaranteed to be perfect. For some models, it may be impossible to robustly tokenize single messages. For example, if the chat template adds tokens after each message, but also has a prefix that is added to the entire chat, it will be impossible to distinguish a chat-start-token from a message-start-token. In these cases, this method will do its best to find the correct tokenization, but it may not be perfect. Note: This method does not support add_generation_prompt. If you want to add a generation prompt, you should do it separately after tokenizing the conversation. Args: message (dict): A dictionary with "role" and "content" keys, representing the message to tokenize. conversation_history (list[dict], optional): A list of dicts with "role" and "content" keys, representing the chat history so far. If you are tokenizing messages one by one, you should pass the previous messages in the conversation here. **kwargs: Additional kwargs to pass to the apply_chat_template method. Returns: list[int]: A list of token ids representing the tokenized message.
`encode_plus`(text[, lca_shift, padding_to_max])	Tokenizes a sequence and returns it in a format suitable for model input, including attention masks.
`from_pretrained`(vocab_file)	Loads a pre-trained tokenizer.
`get_added_vocab`()	Returns the added tokens in the vocabulary as a dictionary of token to index.
`get_chat_template`([chat_template, tools])	Retrieve the chat template string used for tokenizing chat messages.
`get_positions_tokens`(sequence, position)	Get tokens containing the nucleotide at the given position.
`get_special_tokens_mask`(token_ids_0[, ...])	Retrieves sequence ids from a token list that has no special tokens added.
`get_vocab`()	Returns the vocabulary dictionary used by the tokenizer.
`num_special_tokens_to_add`([pair])	Returns the number of added tokens when encoding a sequence with special tokens.
`pad`(encoded_inputs[, padding, max_length, ...])	Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.
`parse_response`(response[, schema])	Converts an output string created by generating text from a model into a parsed message dictionary.
`prepare_for_model`(ids[, pair_ids, ...])	Prepares a sequence of input ids so it can be used by the model.
`prepare_for_tokenization`(text[, ...])	Performs any necessary transformations before tokenization.
`push_to_hub`(repo_id, *[, commit_message, ...])	Upload the tokenizer files to the 🤗 Model Hub.
`register_for_auto_class`([auto_class])	Register this class with a given auto class.
`save_chat_templates`(save_directory, ...)	Writes chat templates out to the save directory if we're using the new format, and removes them from the tokenizer config if present.
`save_pretrained`(save_directory[, ...])	Save the full tokenizer state.
`save_vocabulary`(save_directory[, ...])	Saves the tokenizer's vocabulary to a file in the specified directory.
`tokenize`(text[, lca_shift, all])	Tokenizes a given DNA segment using the Local Context Aware (LCA) approach.
`truncate_sequences`(ids[, pair_ids, ...])	Truncates sequences according to the specified strategy.

Attributes

`SPECIAL_TOKENS_ATTRIBUTES`
`added_tokens_decoder`	Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
`added_tokens_encoder`	Returns the sorted mapping from string to index.
`all_special_ids`	list[int]: List the ids of the special tokens('<unk>', '<cls>', etc.) mapped to class attributes.
`all_special_tokens`	list[str]: A list of all unique special tokens (named + extra) as strings.
`default_cls_token`
`default_mask_token`
`default_pad_token`
`default_sep_token`
`default_unk_token`
`extended_nucleotide_abc`
`is_fast`
`max_len_sentences_pair`	int: The maximum combined length of a pair of sentences that can be fed to the model.
`max_len_single_sentence`	int: The maximum length of a sentence that can be fed to the model.
`max_model_input_sizes`
`model_input_names`
`nucleotide_abc`
`pad_token_type_id`
`padding_side`
`pretrained_init_configuration`
`pretrained_vocab_files_map`
`sequence_unk_token`
`slow_tokenizer_class`
`special_tokens_map`	dict[str, str]: A flat dictionary mapping named special token attributes to their string values.
`truncation_side`
`vocab_files_names`
`vocab_size`	int: Size of the base vocabulary (without the added tokens).