prokbert.sequtils.segment_sequences_random

prokbert.sequtils.segment_sequences_random(sequences: DataFrame, params: Dict[str, Any]) List[Dict[str, Union[int, str]]]

Randomly segments the input sequences.

This function accepts either a list of sequences or a DataFrame containing sequences. If a DataFrame is provided, it’s assumed to have preprocessed sequences with “sequence” and “sequence_id” columns, where “sequence_id” is a valid primary key. The function returns a list of dictionaries, each containing details of a segment including its sequence, start position, end position, associated sequence ID, and a segment ID (not generated in this function).

Parameters
  • sequences (Union[pd.DataFrame, List[str]]) – A DataFrame containing sequences with “sequence” and “sequence_id” columns or a list of sequences.

  • params (Dict[str, Union[int, float, str, Dict, List, Tuple]]) – Dictionary containing segmentation parameters such as ‘coverage’, ‘min_length’, and ‘max_length’.

Returns

A list of dictionaries with each containing details of a segment.

Return type

List[Dict[str, Union[int, str]]]

Notes:
  • The actual number of segments may differ from the expected number due to random sampling and sequences being shorter than the specified segment size.

  • Segment IDs are not generated by this function.