prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode
- ProkBERTTokenizer.encode(segment: str, lca_shift: int = 0, all: bool = False, add_special_tokens: bool = True, **kwargs) List[int]
Encode a DNA sequence into its corresponding token IDs using LCA tokenization.
- Parameters
segment (str) – The DNA segment to encode.
lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.
all (bool, optional) – If True, returns all possible tokenizations. Defaults to False.
add_special_tokens (bool, optional) – Whether to add special tokens like [CLS] and [SEP]. Defaults to True.
- Returns
A list of encoded token IDs. If ‘all’ is True, a list of lists is returned for each possible tokenization.
- Return type
List[int]
- Usage Example:
>>> tokenizer = ProkBERTTokenizer() >>> segment = 'AATCAAGGAATTATTATCGTT' >>> ids = tokenizer.encode(segment) >>> print(ids) ...