prokbert.sequtils.process_batch_tokenize_segments_with_ids

prokbert.sequtils.process_batch_tokenize_segments_with_ids(segments, segment_ids, tokenization_params, np_token_type=<class 'numpy.uint16'>)

Tokenizes a batch of segments and associates them with their provided IDs.

This function generates a vector representation for a collection of segments. It presumes that the segments have undergone quality control. The result is a dictionary where the keys represent the provided segment IDs, and the values are lists of potential vector representations for the segment. Each list element corresponds to a specific shift (e.g., 0-shifted, 1-shifted, etc.).

The vector representations are converted to numpy arrays. Note that the output isn’t a 2D rectangular array but a list of arrays.

Parameters
  • segments (list) – A list of preprocessed and validated segments.

  • segment_ids (list) – A list of segment IDs corresponding to each segment in the segments list.

  • tokenization_params (dict) – A dictionary containing tokenization parameters.

  • np_token_type – Default value = np.uint16

Returns

A dictionary where keys are segment IDs and values are lists of numpy arrays representing tokenized segments.

Return type

dict