prokbert.prokbert_tokenizer.ProkBERTTokenizer.tokenize
- ProkBERTTokenizer.tokenize(text: str, lca_shift: int = 0, all: bool = False) Union[List[str], Tuple[List[List[str]], List[List[str]]]]
Tokenizes a given DNA segment using the Local Context Aware (LCA) approach. Depending on the parameters, it can return either a single tokenized sequence or all possible tokenizations.
- Parameters
text (str) – The DNA segment to be tokenized. Represents a sequence of nucleotides.
lca_shift (int, optional) – Specifies the LCA offset for tokenization. Determines which specific tokenized vector to return. A value of 0 returns the first vector, while higher values return subsequent vectors. Defaults to 0.
all (bool, optional) – If set to True, the method returns all possible tokenizations for the given segment. When False, it returns only the tokenized vector corresponding to the specified LCA offset.
- Returns
If ‘all’ is False, returns a list of tokenized segments corresponding to the specified LCA shift. If ‘all’ is True, returns a tuple containing two lists: one with all possible tokenized segments, and the other with k-merized segments.
- Return type
- Usage Example:
>>> tokenizer = ProkBERTTokenizer(tokenization_params={'kmer': 6, 'shift': 1}, operation_space='sequence') >>> segment = 'AATCAAGGAATTATTATCGTT' >>> tokens, kmers = tokenizer.tokenize(segment, all=True) >>> print(tokens) ...