prokbert.sequtils.load_contigs

prokbert.sequtils.load_contigs(fasta_files_list: Union[List[str], str], adding_reverse_complement: bool = True, IsAddHeader: bool = False, AsDataFrame: bool = False, to_uppercase: bool = False, is_add_sequence_id: bool = False) → Union[List[Union[str, List[str]]], DataFrame]

Loads contigs from a list of FASTA files.

Parameters

fasta_files_list (Union[List[str], str]) – List of paths to FASTA files or a single file path. Compressed (gz) FASTA files are accepted.
adding_reverse_complement (bool) – If True, adds the reverse complement of each sequence. Defaults to True.
IsAddHeader (bool) – If True, includes the FASTA ID and description in the output. Defaults to False.
AsDataFrame (bool) – If True, returns the sequences as a pandas DataFrame. Defaults to False.
to_uppercase (bool) – If True, converts sequences to uppercase. Defaults to False.
is_add_sequence_id (bool) – If True, adds a unique integer sequence ID to each sequence. Defaults to False.

Returns

The loaded sequences. Each sequence is represented as a string if IsAddHeader is False, or as a list [sequence_id, fasta_id, description, source_file, sequence, orientation] if IsAddHeader is True and is_add_sequence_id is True. If AsDataFrame is True, the sequences are returned as a DataFrame.

Return type

Union[List[Union[str, List[str]]], pd.DataFrame]

Example:

>>> fasta_files = ['path/to/file1.fasta', 'path/to/file2.fasta.gz']
>>> load_contigs(fasta_files, adding_reverse_complement=False, IsAddHeader=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
# Returns a DataFrame with the sequences from the specified FASTA files, all in uppercase, with unique sequence IDs.