prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode_plus
- ProkBERTTokenizer.encode_plus(text: str, lca_shift: int = 0, padding_to_max: bool = False, **kwargs) Dict[str, Union[Tensor, ndarray]]
Tokenizes a sequence and returns it in a format suitable for model input, including attention masks.
- Parameters
text (str) – The DNA sequence to tokenize.
lca_shift (int, optional) – LCA offset for tokenization. Specifies the offset in the tokenization process, defaults to 0.
padding_to_max (bool, optional) – If True, pads the tokenized sequence to the maximum length. Defaults to False.
return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).
- Returns
A dictionary containing ‘input_ids’ and ‘attention_mask’, either as numpy arrays or PyTorch tensors, based on ‘return_tensors’ parameter.
- Return type
Dict[str, Union[torch.Tensor, np.ndarray]]
- Usage Example:
>>> tokenizer = ProkBERTTokenizer() >>> segment = 'AATCAAGGAATTATTATCGTT' >>> encoded = tokenizer.encode_plus(segment) >>> print(encoded['input_ids']) ...