prokbert.prokbert_tokenizer.ProkBERTTokenizer.tokenize

ProkBERTTokenizer.tokenize(text: str, lca_shift: int = 0, all: bool = False) → Union[List[str], Tuple[List[List[str]], List[List[str]]]]

Tokenizes a given DNA segment using the Local Context Aware (LCA) approach. Depending on the parameters, it can return either a single tokenized sequence or all possible tokenizations.

Parameters

text (str) – The DNA segment to be tokenized. Represents a sequence of nucleotides.
lca_shift (int, optional) – Specifies the LCA offset for tokenization. Determines which specific tokenized vector to return. A value of 0 returns the first vector, while higher values return subsequent vectors. Defaults to 0.
all (bool, optional) – If set to True, the method returns all possible tokenizations for the given segment. When False, it returns only the tokenized vector corresponding to the specified LCA offset.

Returns

If ‘all’ is False, returns a list of tokenized segments corresponding to the specified LCA shift. If ‘all’ is True, returns a tuple containing two lists: one with all possible tokenized segments, and the other with k-merized segments.

Return type

Union[List[str], Tuple[List[List[str]], List[List[str]]]]

Usage Example:

>>> tokenizer = ProkBERTTokenizer(tokenization_params={'kmer': 6, 'shift': 1}, operation_space='sequence')
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> tokens, kmers = tokenizer.tokenize(segment, all=True)
>>> print(tokens)
...