prokbert.sequtils.process_batch_tokenize_segments_with_ids

prokbert.sequtils.process_batch_tokenize_segments_with_ids(segments: ~typing.List[str], segment_ids: ~typing.List[~typing.Any], tokenization_params: ~typing.Dict[str, ~typing.Any], np_token_type: type = <class 'numpy.uint16'>) → Dict[Any, List[ndarray]]

Tokenizes a batch of segments and associates them with their provided IDs.

This function generates vector representations for a collection of segments, assuming the segments have undergone quality control. The result is a dictionary where the keys are segment IDs, and the values are lists of potential vector representations for the segment, with each list element corresponding to a specific shift.

The vector representations are converted to numpy arrays. The output is not a 2D rectangular array but a dictionary mapping each segment ID to its tokenized representations.

Parameters

segments (List[str]) – A list of preprocessed and validated segments.
segment_ids (List[Any]) – A list of segment IDs corresponding to each segment in segments.
tokenization_params (Dict[str, Any]) – A dictionary containing tokenization parameters.
np_token_type (type, optional) – Numpy data type for the tokenized segments. Defaults to np.uint16.

Returns

A dictionary with segment IDs as keys and lists of numpy arrays representing tokenized segments as values.

Return type

Dict[Any, List[np.ndarray]]

Example:

>>> segments = ['ACTG', 'TGCA']
>>> segment_ids = [1, 2]
>>> tokenization_params = {'max_segment_length': 50, ...}
>>> tokenized_segments = process_batch_tokenize_segments_with_ids(
        segments, segment_ids, tokenization_params
    )