prokbert.prokbert_tokenizer.ProkBERTTokenizer.batch_encode_plus
- ProkBERTTokenizer.batch_encode_plus(batch_text_or_text_pairs: List[str], lca_shift: int = 0, all: bool = False, **kwargs) Dict[str, List[List[int]]]
Tokenizes multiple sequences and returns them in a format suitable for model input. Assumes that sequences have already been preprocessed (segmented) and quality controlled (i.e. all uppercase).
- Parameters
batch_text_or_text_pairs (List[str]) – A list of DNA sequences to be tokenized.
lca_shift (int, optional) – Specifies the LCA offset for tokenization. If the required offset is >= shift, an error is raised. Defaults to 0.
all (bool, optional) – If True, returns all possible tokenization vectors. If False, only the tokenization corresponding to the specified offset is used. Defaults to False.
return_tensors (str, optional) – Optional argument to specify the return type (numpy ndarray or PyTorch tensor).
- Returns
A dictionary containing ‘input_ids’, ‘token_type_ids’, and ‘attention_mask’ as lists of lists. If ‘return_tensors’ is specified as ‘pt’, these are returned as PyTorch tensors.
- Return type
- Usage Example:
>>> tokenizer = ProkBERTTokenizer() >>> segments = ['AATCAAGGA', 'ATTATTATCGTT'] >>> encoded = tokenizer.batch_encode_plus(segments) >>> print(encoded['input_ids']) ...