prokbert.prokbert_tokenizer.ProkBERTTokenizer.batch_encode_plus

ProkBERTTokenizer.batch_encode_plus(batch_text_or_text_pairs: List[str], lca_shift: int = 0, all: bool = False, **kwargs) → Dict[str, List[List[int]]]

Tokenizes multiple sequences and returns them in a format suitable for model input. Assumes that sequences have already been preprocessed (segmented) and quality controlled (i.e. all uppercase).

Parameters

batch_text_or_text_pairs (List[str]) – A list of DNA sequences to be tokenized.
lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.
all (bool, optional) – If True, returns all possible tokenization vectors. If False, only the tokenization corresponding to the specified offset is used. Defaults to False.
return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).

Returns

A dictionary containing ‘input_ids’, ‘token_type_ids’, and ‘attention_mask’ as lists of lists. If ‘return_tensors’ is specified as ‘pt’, these are returned as PyTorch tensors.

Return type

Dict[str, List[List[int]]]

Usage Example:

>>> tokenizer = ProkBERTTokenizer()
>>> segments = ['AATCAAGGA', 'ATTATTATCGTT']
>>> encoded = tokenizer.batch_encode_plus(segments)
>>> print(encoded['input_ids'])
...