prokbert.prokbert_tokenizer.ProkBERTTokenizer

class prokbert.prokbert_tokenizer.ProkBERTTokenizer(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)

Custom tokenizer for ProkBERT, handling specific tokenization processes required for ProkBERT, including LCA tokenization and sequence segmentation. ProkBERT employs LCA tokenization, leveraging overlapping k-mers to capture rich local context information, enhancing model generalization and performance. The key parameters are the k-mer size and shift. For instance, with a k-mer size of 6 and a shift of 1, the tokenization captures detailed sequence information, while a k-mer size of 1 represents a basic character-based approach.

Parameters
  • tokenization_params (dict) – Parameters for tokenization, derived from the ‘tokenization’ part of the config. Expected keys include ‘type’, ‘kmer’, ‘shift’, etc. See below for detailed descriptions.

  • segmentation_params (dict) – Parameters for segmentation, derived from the ‘segmentation’ part of the config. Includes ‘type’, ‘min_length’, ‘max_length’, etc.

  • comp_params (dict) – Computation parameters from the ‘computation’ part of the config, like CPU cores and batch sizes.

  • operation_space (str) – Defines the operation space (‘sequence’ or ‘kmer’).

Tokenization Parameters:
  • type (str): Tokenization approach, default ‘lca’ for Local Context Aware.

  • kmer (int): k-mer size for tokenization.

  • shift (int): Shift parameter in k-mer.

  • max_segment_length (int): Maximum number of characters in a segment.

  • token_limit (int): Maximum token count for language model processing.

  • max_unknown_token_proportion (float): Maximum allowed proportion of unknown tokens.

  • vocabfile (str): Path to the vocabulary file.

  • isPaddingToMaxLength (bool): Whether to pad sentences to a fixed length.

  • add_special_token (bool): Whether to add special tokens like [CLS], [SEP].

Segmentation Parameters:
  • type (str): Segmentation type, ‘contiguous’ or ‘random’.

  • min_length (int): Minimum length for a segment.

  • max_length (int): Maximum length for a segment.

  • coverage (float): Expected average coverage of positions in the sequence.

Computation Parameters:
  • cpu_cores_for_segmentation (int): Number of CPU cores for segmentation.

  • cpu_cores_for_tokenization (int): Number of CPU cores for tokenization.

  • batch_size_tokenization (int): Batch size for tokenization.

  • batch_size_fasta_segmentation (int): Batch size for fasta file processing.

  • numpy_token_integer_prec_byte (int): Integer precision byte for vectorization.

  • np_tokentype (type): Data type for numpy token arrays.

Usage Example:
>>> tokenization_parameters = {'kmer': 6, 'shift': 1}
>>> tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters)
>>> encoded = tokenizer('ATTCTTT')
>>> print(encoded)
__init__(tokenization_params: Dict = {}, segmentation_params: Dict = {}, comp_params: Dict = {}, operation_space: str = 'sequence', **kwargs)
Parameters
  • tokenization_params (Dict) – Dictionary containing tokenization parameters such as k-mer size, shift, max segment length, and more. Defaults to an empty dictionary.

  • segmentation_params (Dict) – Dictionary containing segmentation parameters like type, min/max length, and coverage. Defaults to an empty dictionary.

  • comp_params (Dict) – Dictionary containing computational parameters as described above

  • operation_space (str) – Specifies the operation mode, which can be either ‘kmer’ or ‘sequence’. Defaults to ‘sequence’.

The class supports extended vocabulary and custom unknown tokens for sequence-based operation, and aligns with standard tokenization protocols for language models.

Returns

None

Example:
>>> tokenizer = ProkBERTTokenizer(tokenization_params={'kmer': 6, 'shift': 1}, operation_space='sequence')
>>> tokenizer.tokenize("ACGTACGT")

Methods

__init__([tokenization_params, ...])

param tokenization_params

Dictionary containing tokenization parameters such as k-mer size,

add_special_tokens(special_tokens_dict[, ...])

Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes.

add_tokens(new_tokens[, special_tokens])

#TODO remove this from here! PreTrainedTOkeniuzerBase should be agnostic of AddedToken.

apply_chat_template(conversation[, tools, ...])

Converts a list of dictionaries with "role" and "content" keys to a list of token ids.

batch_decode(token_ids_list, **kwargs)

Decodes multiple token ID sequences back into their original DNA sequences.

batch_encode_plus(batch_text_or_text_pairs)

Tokenizes multiple sequences and returns them in a format suitable for model input.

build_inputs_with_special_tokens(token_ids_0)

Build model inputs from a sequence or a pair of sequences by adding special tokens.

clean_up_tokenization(text)

Clean up tokenization spaces in a given text.

convert_added_tokens(obj[, save, add_type_field])

convert_ids_to_tokens(ids)

Converts token IDs back to their original tokens.

convert_to_native_format(**kwargs)

convert_tokens_to_ids(tokens)

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

convert_tokens_to_string(tokens)

Converts a sequence of tokens in a single string.

create_token_type_ids_from_sequences(token_ids_0)

Create a mask from the two sequences passed to be used in a sequence-pair classification task.

decode(ids)

Decodes a list of token IDs back to the original DNA sequence.

encode(segment[, lca_shift, all, ...])

Encode a DNA sequence into its corresponding token IDs using LCA tokenization.

encode_message_with_chat_template(message[, ...])

Tokenize a single message. This method is a convenience wrapper around apply_chat_template that allows you to tokenize messages one by one. This is useful for things like token-by-token streaming. This method is not guaranteed to be perfect. For some models, it may be impossible to robustly tokenize single messages. For example, if the chat template adds tokens after each message, but also has a prefix that is added to the entire chat, it will be impossible to distinguish a chat-start-token from a message-start-token. In these cases, this method will do its best to find the correct tokenization, but it may not be perfect. Note: This method does not support add_generation_prompt. If you want to add a generation prompt, you should do it separately after tokenizing the conversation. Args: message (dict): A dictionary with "role" and "content" keys, representing the message to tokenize. conversation_history (list[dict], optional): A list of dicts with "role" and "content" keys, representing the chat history so far. If you are tokenizing messages one by one, you should pass the previous messages in the conversation here. **kwargs: Additional kwargs to pass to the apply_chat_template method. Returns: list[int]: A list of token ids representing the tokenized message.

encode_plus(text[, lca_shift, padding_to_max])

Tokenizes a sequence and returns it in a format suitable for model input, including attention masks.

from_pretrained(vocab_file)

Loads a pre-trained tokenizer.

get_added_vocab()

Returns the added tokens in the vocabulary as a dictionary of token to index.

get_chat_template([chat_template, tools])

Retrieve the chat template string used for tokenizing chat messages.

get_positions_tokens(sequence, position)

Get tokens containing the nucleotide at the given position.

get_special_tokens_mask(token_ids_0[, ...])

Retrieves sequence ids from a token list that has no special tokens added.

get_vocab()

Returns the vocabulary dictionary used by the tokenizer.

num_special_tokens_to_add([pair])

Returns the number of added tokens when encoding a sequence with special tokens.

pad(encoded_inputs[, padding, max_length, ...])

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

parse_response(response[, schema])

Converts an output string created by generating text from a model into a parsed message dictionary.

prepare_for_model(ids[, pair_ids, ...])

Prepares a sequence of input ids so it can be used by the model.

prepare_for_tokenization(text[, ...])

Performs any necessary transformations before tokenization.

push_to_hub(repo_id, *[, commit_message, ...])

Upload the tokenizer files to the 🤗 Model Hub.

register_for_auto_class([auto_class])

Register this class with a given auto class.

save_chat_templates(save_directory, ...)

Writes chat templates out to the save directory if we're using the new format, and removes them from the tokenizer config if present.

save_pretrained(save_directory[, ...])

Save the full tokenizer state.

save_vocabulary(save_directory[, ...])

Saves the tokenizer's vocabulary to a file in the specified directory.

tokenize(text[, lca_shift, all])

Tokenizes a given DNA segment using the Local Context Aware (LCA) approach.

truncate_sequences(ids[, pair_ids, ...])

Truncates sequences according to the specified strategy.

Attributes

SPECIAL_TOKENS_ATTRIBUTES

added_tokens_decoder

Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.

added_tokens_encoder

Returns the sorted mapping from string to index.

all_special_ids

list[int]: List the ids of the special tokens('<unk>', '<cls>', etc.) mapped to class attributes.

all_special_tokens

list[str]: A list of all unique special tokens (named + extra) as strings.

default_cls_token

default_mask_token

default_pad_token

default_sep_token

default_unk_token

extended_nucleotide_abc

is_fast

max_len_sentences_pair

int: The maximum combined length of a pair of sentences that can be fed to the model.

max_len_single_sentence

int: The maximum length of a sentence that can be fed to the model.

max_model_input_sizes

model_input_names

nucleotide_abc

pad_token_type_id

padding_side

pretrained_init_configuration

pretrained_vocab_files_map

sequence_unk_token

slow_tokenizer_class

special_tokens_map

dict[str, str]: A flat dictionary mapping named special token attributes to their string values.

truncation_side

vocab_files_names

vocab_size

int: Size of the base vocabulary (without the added tokens).