prokbert.sequtils.tokenize_kmerized_segment_list

prokbert.sequtils.tokenize_kmerized_segment_list(kmerized_segments: List[List[str]], vocabmap: Dict[str, int], token_limit: int, max_unknown_token_proportion: float, add_special_tokens: bool = True) → List[List[int]]

Tokenizes or vectorizes a list of k-merized segments into a list of token vectors. If the expected number of tokens in a segment exceeds the maximum allowed tokens (token_limit), the function raises an error. For segments where unknown k-mers exceed the proportion set by max_unknown_token_proportion, the output is a special token sequence indicating an empty sentence.

Parameters

kmerized_segments (List[List[str]]) – List containing k-merized segments.
vocabmap (Dict[str, int]) – Dictionary that maps k-mers to their respective token values.
token_limit (int) – Maximum number of tokens allowed in the tokenized output.
max_unknown_token_proportion (float) – Maximum allowable proportion of unknown tokens in a segment.
add_special_tokens (bool, optional (default=True)) – Whether to add special tokens ([CLS] and [SEP]) to the tokenized segments.

Returns

List containing tokenized segments.

Return type

List[List[int]]

Raises

ValueError – If the expected number of tokens in a segment exceeds token_limit.

>>> vocabmap_example = {"[CLS]": 2, "[SEP]": 3, "[UNK]": 0, "TCTTTG": 4, "CTTTGC": 5, "TTTGCT": 6, "TTGCTA": 7}
>>> kmerized_segment_example = [['TCTTTG', 'CTTTGC', 'TTTGCT', 'TTGCTA']]
>>> tokenize_kmerized_segment_list(kmerized_segment_example, vocabmap_example, 10, 0.2)
[[2, 4, 5, 6, 7, 3]]