prokbert.sequtils.segment_sequences_random

prokbert.sequtils.segment_sequences_random(sequences, params)

Randomly segment the input sequences.

This function takes a list of sequences or a DataFrame containing sequences. If a DataFrame is provided, it’s assumed to be preprocessed, where the “sequence” column stores the sequences to be segmented, and “sequence_id” serves as a valid primary key.

The actual coverage may differ from the expected one. The function returns a list of dictionaries, each containing information about a segment, including its sequence, start position, end position, associated sequence ID, and a segment ID. Note that segment IDs are not generated in this function.

Parameters
  • sequences (pd.DataFrame or list) – A DataFrame containing sequences in the “sequence” column and their associated IDs in “sequence_id” or a list of sequences.

  • params (dict) – A dictionary containing segmentation parameters, including ‘coverage’, ‘min_length’, and ‘max_length’.

Returns

A list of dictionaries. Each dictionary contains information about a segment, including its sequence, start position, end position, associated sequence ID, and a segment ID. Note that segment IDs are not generated in this function.

Return type

list of dict

Notes

The actual number of segments may differ from the expected number due to the random sampling nature and the presence of sequences shorter than the segment size.