prokbert.sequtils.process_batch_tokenize_segments_with_ids
- prokbert.sequtils.process_batch_tokenize_segments_with_ids(segments, segment_ids, tokenization_params, np_token_type=<class 'numpy.uint16'>)
Tokenizes a batch of segments and associates them with their provided IDs.
This function generates a vector representation for a collection of segments. It presumes that the segments have undergone quality control. The result is a dictionary where the keys represent the provided segment IDs, and the values are lists of potential vector representations for the segment. Each list element corresponds to a specific shift (e.g., 0-shifted, 1-shifted, etc.).
The vector representations are converted to numpy arrays. Note that the output isn’t a 2D rectangular array but a list of arrays.
- Parameters
- Returns
A dictionary where keys are segment IDs and values are lists of numpy arrays representing tokenized segments.
- Return type