prokbert.sequtils.lca_tokenize_segment

prokbert.sequtils.lca_tokenize_segment(segment: str, params: Dict[str, Any]) → Tuple[List[List[int]], List[List[str]]]

Tokenizes a single segment using Local Context Aware (LCA) tokenization. The segment is first split into k-mers with specified shifts and then tokenized into token vectors.

Parameters

segment (str) – The input nucleotide sequence segment to be tokenized.
params (dict) – Dictionary containing the tokenization parameters. - ‘shift’ (int): The k-mer shift parameter. - ‘max_segment_length’ (int): Maximum allowable segment length. - ‘max_unknown_token_proportion’ (float): Maximum allowable proportion of unknown tokens in a segment. - ‘kmer’ (int): Size of the k-mer. - ‘token_limit’ (int): Maximum number of tokens allowed in the tokenized output. - ‘vocabmap’ (dict[str, int]): Dictionary mapping k-mers to their respective token values.

Returns

A tuple containing: - list[list[int]]: List of tokenized segments (each segment as a list of integers). - list[list[str]]: List of k-merized segments with different shifts (each segment as a list of strings).

Return type

Tuple[List[List[int]], List[List[str]]]

Raises

ValueError – If the segment length exceeds the max_segment_length.

Examples:

>>> vocabmap_example = {"[CLS]": 2, "[SEP]": 3, "[UNK]": 0, "TCTTT": 4, "CTTTG": 5, "TTTGC": 6, "TTGCT": 7}
>>> segment_example = 'TCTTTGCTAAG'
>>> params_example = {'shift': 1, 'max_segment_length': 512, 'max_unknown_token_proportion': 0.2, 'kmer': 5, 'token_limit': 10, 'vocabmap': vocabmap_example}
>>> lca_tokenize_segment(segment_example, params_example)
([[2, 4, 5, 6, 7, 3]], [['TCTTT', 'CTTTG', 'TTTGC', 'TTGCT']])