prokbert.sequtils.segment_sequences

prokbert.sequtils.segment_sequences(sequences: DataFrame, params: Dict[str, Union[int, float, str]], AsDataFrame: bool = False) → Union[List[str], DataFrame]

Segments sequences based on the provided parameters.

This function assumes that the sequence is quality controlled and preprocessed, i.e., it is a valid nucleotide sequence. If sequences are provided as a DataFrame, then it is assumed that there is a “sequence_id” and a “sequence” attribute. The “sequence_id” should be a valid primary key. If the output is requested as a DataFrame, then the IDs are added as well.

Parameters

sequences (Union[List[str], pd.DataFrame]) – A list of sequences or a DataFrame containing sequences. If a DataFrame, it must have “sequence_id” and “sequence” attributes.
params (Dict[str, Union[int, float, str, Dict[str, int], List[int], Tuple[int, int]]]) – Dictionary containing the segmentation parameters. - ‘type’ (str): The type of segmentation (‘contiguous’ or ‘random’). - ‘min_length’ (int): Minimum length of a segment. - ‘max_length’ (int): Maximum length of a segment. - ‘coverage’ (float): Coverage percentage for random segmentation.
AsDataFrame (bool) – If True, the output will be a DataFrame. If False, it will be a list. Defaults to False.

Returns

List of segmented sequences or a DataFrame with segmented sequences and their corresponding information based on the AsDataFrame parameter.

Return type

Union[List[str], pd.DataFrame]

Raises

ValueError – If the provided sequences DataFrame does not have the required attributes.
ValueError – If the “sequence_id” column is not a valid primary key.

Examples:

>>> segment_sequences(['AATCAATTTTATTT', 'AGCCGATTCAATTGCATTATTT'], {'type': 'contiguous', 'min_length': 1, 'max_length': 1000, 'coverage': 1.0})