prokbert.sequtils.tokenize_kmerized_segment_list
- prokbert.sequtils.tokenize_kmerized_segment_list(kmerized_segments: List[List[str]], vocabmap: Dict[str, int], token_limit: int, max_unknown_token_proportion: float, add_special_tokens: bool = True) List[List[int]]
Tokenizes or vectorizes a list of k-merized segments into a list of token vectors. If the expected number of tokens in a segment exceeds the maximum allowed tokens (token_limit), the function raises an error. For segments where unknown k-mers exceed the proportion set by max_unknown_token_proportion, the output is a special token sequence indicating an empty sentence.
- Parameters
kmerized_segments (List[List[str]]) – List containing k-merized segments.
vocabmap (Dict[str, int]) – Dictionary that maps k-mers to their respective token values.
token_limit (int) – Maximum number of tokens allowed in the tokenized output.
max_unknown_token_proportion (float) – Maximum allowable proportion of unknown tokens in a segment.
add_special_tokens (bool, optional (default=True)) – Whether to add special tokens ([CLS] and [SEP]) to the tokenized segments.
- Returns
List containing tokenized segments.
- Return type
List[List[int]]
- Raises
ValueError – If the expected number of tokens in a segment exceeds token_limit.
>>> vocabmap_example = {"[CLS]": 2, "[SEP]": 3, "[UNK]": 0, "TCTTTG": 4, "CTTTGC": 5, "TTTGCT": 6, "TTGCTA": 7} >>> kmerized_segment_example = [['TCTTTG', 'CTTTGC', 'TTTGCT', 'TTGCTA']] >>> tokenize_kmerized_segment_list(kmerized_segment_example, vocabmap_example, 10, 0.2) [[2, 4, 5, 6, 7, 3]]