prokbert.sequtils.segment_sequence_contiguous

prokbert.sequtils.segment_sequence_contiguous(sequence: str, params: Dict[str, Any], sequence_id: Optional[Any] = nan) List[Dict[str, Any]]

Creates end-to-end, disjoint segments of a sequence without overlaps.

Segments smaller than the predefined minimum length will be discarded. This function returns a list of segments along with their positions in the original sequence.

Parameters
  • sequence (str) – The input nucleotide sequence to be segmented.

  • params (Dict[str, Any]) – Dictionary containing the segmentation parameters. Must include ‘min_length’ and ‘max_length’ keys specifying the minimum and maximum lengths of the segments, respectively. Can contain other parameters.

  • sequence_id (Optional[Any]) – An identifier for the sequence, optional. Defaults to NaN.

Returns

A list of dictionaries, each representing a segment. Each dictionary contains the segment’s sequence, start position, end position, and sequence ID.

Return type

List[Dict[str, Any]]

Example:
>>> params = {'min_length': 0, 'max_length': 100}
>>> segment_sequence_contiguous('ATCGATCGA', params)
[{'segment': 'ATCGATCGA', 'segment_start': 0, 'segment_end': 9, 'sequence_id': np.nan}]