prokbert.sequtils.segment_sequences_random
- prokbert.sequtils.segment_sequences_random(sequences: DataFrame, params: Dict[str, Any]) List[Dict[str, Union[int, str]]]
Randomly segments the input sequences.
This function accepts either a list of sequences or a DataFrame containing sequences. If a DataFrame is provided, it’s assumed to have preprocessed sequences with “sequence” and “sequence_id” columns, where “sequence_id” is a valid primary key. The function returns a list of dictionaries, each containing details of a segment including its sequence, start position, end position, associated sequence ID, and a segment ID (not generated in this function).
- Parameters
sequences (Union[pd.DataFrame, List[str]]) – A DataFrame containing sequences with “sequence” and “sequence_id” columns or a list of sequences.
params (Dict[str, Union[int, float, str, Dict, List, Tuple]]) – Dictionary containing segmentation parameters such as ‘coverage’, ‘min_length’, and ‘max_length’.
- Returns
A list of dictionaries with each containing details of a segment.
- Return type
- Notes:
The actual number of segments may differ from the expected number due to random sampling and sequences being shorter than the specified segment size.
Segment IDs are not generated by this function.