prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode

ProkBERTTokenizer.encode(segment: str, lca_shift: int = 0, all: bool = False, add_special_tokens: bool = True, **kwargs) List[int]

Encode a DNA sequence into its corresponding token IDs using LCA tokenization.

Parameters
  • segment (str) – The DNA segment to encode.

  • lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.

  • all (bool, optional) – If True, returns all possible tokenizations. Defaults to False.

  • add_special_tokens (bool, optional) – Whether to add special tokens like [CLS] and [SEP]. Defaults to True.

Returns

A list of encoded token IDs. If ‘all’ is True, a list of lists is returned for each possible tokenization.

Return type

List[int]

Usage Example:
>>> tokenizer = ProkBERTTokenizer()
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> ids = tokenizer.encode(segment)
>>> print(ids)
...