prokbert.sequtils.segment_sequences
- prokbert.sequtils.segment_sequences(sequences, params, AsDataFrame=False)
Segment sequences based on the provided parameters.
We assume that the sequence is quality controlled and preprocessed, i.e., is a valid nucleotide sequence, etc. If sequences are provided as a DataFrame, then it is assumed that there is a “sequence_id” and a “sequence” attribute. The “sequence_id” should be a valid primary key. If the output is requested as a DataFrame, then the IDs are added as well.
- Parameters
sequences (list or pd.DataFrame) – A list of sequences or a DataFrame containing sequences. If a DataFrame, it must have “sequence_id” and “sequence” attributes.
params (dict) – Dictionary containing the segmentation parameters. The ‘type’ key in the dictionary can be ‘contiguous’ or ‘random’.
AsDataFrame (bool, optional) – If True, the output will be a DataFrame. If False, it will be a list. Defaults to False.
- Returns
List of segmented sequences or a DataFrame with segmented sequences and their corresponding information based on the AsDataFrame parameter.
- Return type
list or pd.DataFrame
- Raises
ValueError – If the provided sequences DataFrame does not have the required attributes.
ValueError – If the “sequence_id” column is not a valid primary key.
If the segmentation type is ‘random’, the functionality is yet to be implemented. Examples ——– TODO: Add examples after finalizing the function’s behavior and output.