prokbert.sequtils.batch_tokenize_segments_with_ids

prokbert.sequtils.batch_tokenize_segments_with_ids(segment_data, tokenization_params, num_cores=1, batch_size=10000, np_token_type=<class 'numpy.uint16'>)

Parallel tokenization of segments. If the segments are provided as DataFrame then it is splitted into junks specified in the paramaters The default number of cores are the maximum available ones. If the segment data is a tuple, then it is expected the first element is the list segments, while the second elements are the ids. Please note that the segment_ids should be unique. The segments should quality controlloed.

Parameters
  • segment_data – param tokenization_params:

  • num_cores – Default value = 1)

  • batch_size – Default value = 10000)

  • np_token_type – Default value = np.uint16)

  • tokenization_params