prokbert.prokbert_tokenizer.ProkBERTTokenizer.encode_plus

ProkBERTTokenizer.encode_plus(text: str, lca_shift: int = 0, padding_to_max: bool = False, **kwargs) → Dict[str, Union[Tensor, ndarray]]

Tokenizes a sequence and returns it in a format suitable for model input, including attention masks.

Parameters

text (str) – The DNA sequence to tokenize.
lca_shift (int, optional) – LCA offset for tokenization. Specifies the offset in the tokenization process, defaults to 0.
padding_to_max (bool, optional) – If True, pads the tokenized sequence to the maximum length. Defaults to False.
return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).

Returns

A dictionary containing ‘input_ids’ and ‘attention_mask’, either as numpy arrays or PyTorch tensors, based on ‘return_tensors’ parameter.

Return type

Dict[str, Union[torch.Tensor, np.ndarray]]

Usage Example:

>>> tokenizer = ProkBERTTokenizer()
>>> segment = 'AATCAAGGAATTATTATCGTT'
>>> encoded = tokenizer.encode_plus(segment)
>>> print(encoded['input_ids'])
...