prokbert.sequtils.batch_tokenize_segments_with_ids

prokbert.sequtils.batch_tokenize_segments_with_ids(segment_data: ~typing.Union[~typing.Tuple[~typing.List[str], ~typing.List[~typing.Any]], ~pandas.core.frame.DataFrame], tokenization_params: ~typing.Dict[str, ~typing.Any], num_cores: int = 1, batch_size: int = 10000, np_token_type: type = <class 'numpy.uint16'>) → Dict[Any, List[ndarray]]

Parallel tokenization of segments with associated IDs.

This function splits the input data into batches and uses multiprocessing to tokenize the segments in parallel. It supports both list/tuple inputs and pandas DataFrames.

Parameters

segment_data (Union[Tuple[List[str], List[Any]], pd.DataFrame]) – Either a tuple/list containing two elements (segments, segment_ids), or a pandas DataFrame with ‘segment’ and ‘segment_id’ columns.
tokenization_params (Dict[str, Any]) – Dictionary containing tokenization parameters.
num_cores (int, optional) – Number of CPU cores to use for parallel processing. Defaults to 1.
batch_size (int, optional) – Number of segments to process in each batch. Defaults to 10,000.
np_token_type (type, optional) – Numpy data type for the tokenized segments. Defaults to np.uint16.

Returns

A dictionary where keys are segment IDs and values are lists of numpy arrays representing tokenized segments.

Return type

Dict[Any, List[np.ndarray]]

Raises

ValueError – If the input data is neither a tuple/list nor a pandas DataFrame.

Example:

>>> segments = ['ACTG', 'TGCA']
>>> segment_ids = [1, 2]
>>> tokenization_params = {'max_segment_length': 50, ...}
>>> tokenized_data = batch_tokenize_segments_with_ids(
        (segments, segment_ids),
        tokenization_params,
        num_cores=4,
        batch_size=1000
    )