prokbert.sequtils.process_batch_tokenize_segments_with_ids
- prokbert.sequtils.process_batch_tokenize_segments_with_ids(segments: ~typing.List[str], segment_ids: ~typing.List[~typing.Any], tokenization_params: ~typing.Dict[str, ~typing.Any], np_token_type: type = <class 'numpy.uint16'>) Dict[Any, List[ndarray]]
Tokenizes a batch of segments and associates them with their provided IDs.
This function generates vector representations for a collection of segments, assuming the segments have undergone quality control. The result is a dictionary where the keys are segment IDs, and the values are lists of potential vector representations for the segment, with each list element corresponding to a specific shift.
The vector representations are converted to numpy arrays. The output is not a 2D rectangular array but a dictionary mapping each segment ID to its tokenized representations.
- Parameters
segments (List[str]) – A list of preprocessed and validated segments.
segment_ids (List[Any]) – A list of segment IDs corresponding to each segment in segments.
tokenization_params (Dict[str, Any]) – A dictionary containing tokenization parameters.
np_token_type (type, optional) – Numpy data type for the tokenized segments. Defaults to np.uint16.
- Returns
A dictionary with segment IDs as keys and lists of numpy arrays representing tokenized segments as values.
- Return type
Dict[Any, List[np.ndarray]]
- Example:
>>> segments = ['ACTG', 'TGCA'] >>> segment_ids = [1, 2] >>> tokenization_params = {'max_segment_length': 50, ...} >>> tokenized_segments = process_batch_tokenize_segments_with_ids( segments, segment_ids, tokenization_params )