Configuration Utils

BaseConfig

`prokbert.config_utils.BaseConfig`()	Base class for managing and validating configurations.
`prokbert.config_utils.BaseConfig.cast_to_expected_type`(...)	Cast the given value to the expected type.
`prokbert.config_utils.BaseConfig.get_parameter`(...)	Retrieve the default value of a specified parameter.
`prokbert.config_utils.BaseConfig.validate_type`(...)	Validate the type of a given value against the expected type.
`prokbert.config_utils.BaseConfig.validate_value`(...)	Validate the value of a parameter against its constraints.
`prokbert.config_utils.BaseConfig.validate`(...)	Validate both the type and value of a parameter.
`prokbert.config_utils.BaseConfig.describe`(...)	Retrieve the description of a parameter.

class prokbert.config_utils.BaseConfig

Base class for managing and validating configurations.

cast_to_expected_type(parameter_class: str, parameter_name: str, value: Any) → Any

Cast the given value to the expected type.

Parameters

parameter_class (str) – The class/category of the parameter.
parameter_name (str) – The name of the parameter.
value (Any) – The value to be casted.

Returns

Value casted to the expected type.

Return type

Any

Raises

ValueError – If casting fails.

static create_parser(config: dict) → ArgumentParser

Create and configure an argparse parser based on the given configuration.

This method sets up a command-line argument parser with arguments defined in the configuration. Each top-level key in the configuration represents a group of related arguments.

Parameters: config (dict) – A dictionary where each key is a group name and each value is a dict of parameters for that group. Each parameter’s information should include its type, default value, and help description.
Returns: Configured argparse.ArgumentParser instance with arguments added as specified in the configuration.
Return type: argparse.ArgumentParser
Raises: ValueError – If an unknown or unsupported type is specified for a parameter.

describe(parameter_class: str, parameter_name: str) → str

Retrieve the description of a parameter.

Parameters

parameter_class (str) – The class/category of the parameter.
parameter_name (str) – The name of the parameter.

Returns

Description of the parameter.

Return type

str

get_parameter(parameter_class: str, parameter_name: str) → Any

Retrieve the default value of a specified parameter.

Parameters

parameter_class (str) – The class/category of the parameter (e.g., ‘segmentation’).
parameter_name (str) – The name of the parameter.

Returns

Default value of the parameter, casted to the expected type.

Return type

Any

static rename_non_unique_parameters(config: dict) → tuple[dict, dict, dict]

Rename parameters in the configuration to ensure uniqueness across different groups.

This method identifies parameters with the same name across different groups and renames them by prefixing the group name. This is to prevent conflicts when parameters are used in a context where the group name is not specified.

Parameters

config (dict) – A dictionary where each key is a group name and each value is a dict of parameters for that group.

Returns

A tuple containing: - renamed_config: A dictionary with the same structure as the input, but with non-unique parameter

names renamed. The structure is {group_name: {param_name: param_info}}.

cmd_argument2group_param: A dictionary mapping the new parameter names to their original group and parameter name. The structure is {new_param_name: [group_name, original_param_name]}.
group2param2cmdarg: A dictionary mapping each group to a dict that maps the original parameter names to the new parameter names. The structure is {group_name: {original_param_name: new_param_name}}.

Return type

tuple[dict, dict, dict]

validate(parameter_class: str, parameter_name: str, value: Any)

Validate both the type and value of a parameter.

Parameters

parameter_class (str) – The class/category of the parameter.
parameter_name (str) – The name of the parameter.
value (Any) – The value to be validated.

Raises

TypeError – If the value is not of the expected type.
ValueError – If the value does not meet the parameter’s constraints.

validate_type(parameter_class: str, parameter_name: str, value: Any) → bool

Validate the type of a given value against the expected type.

Parameters

parameter_class (str) – The class/category of the parameter.
parameter_name (str) – The name of the parameter.
value (Any) – The value to be validated.

Returns

True if the value is of the expected type, otherwise False.

Return type

bool

validate_value(parameter_class: str, parameter_name: str, value: Any) → bool

Validate the value of a parameter against its constraints.

Parameters

parameter_class (str) – The class/category of the parameter.
parameter_name (str) – The name of the parameter.
value (Any) – The value to be validated.

Returns

True if the value meets the constraints, otherwise False.

Return type

bool

SeqConfig

`prokbert.config_utils.SeqConfig`()	Class to manage and validate sequence processing configurations.
`prokbert.config_utils.SeqConfig._get_default_sequence_processing_config_file`()	Retrieve the default sequence processing configuration file.
`prokbert.config_utils.SeqConfig.get_and_set_segmentation_parameters`([...])	Retrieve and validate the provided parameters for segmentation.
`prokbert.config_utils.SeqConfig.get_and_set_tokenization_parameters`([...])
`prokbert.config_utils.SeqConfig.get_and_set_computational_parameters`([...])	Reading and validating the computational paramters
`prokbert.config_utils.SeqConfig.get_maximum_segment_length_from_token_count_from_params`()	Calculating the maximum length of the segment from the token count
`prokbert.config_utils.SeqConfig.get_maximum_segment_length_from_token_count`(...)	Calcuates how long sequence can be covered
`prokbert.config_utils.SeqConfig.get_maximum_token_count_from_max_length`(...)	Calcuates how long sequence can be covered

class prokbert.config_utils.SeqConfig

Bases: BaseConfig

Class to manage and validate sequence processing configurations.

get_and_set_computational_parameters(parameters: dict = {}) → dict: Reading and validating the computational paramters

get_and_set_segmentation_parameters(parameters: dict = {}) → dict

Retrieve and validate the provided parameters for segmentation.

Parameters: parameters (dict) – A dictionary of parameters to be validated.
Returns: A dictionary of validated segmentation parameters.
Return type: dict
Raises: ValueError – If an invalid segmentation parameter is provided.

get_cmd_arg_parser() → tuple[argparse.ArgumentParser, dict, dict]

Create and return a command-line argument parser for ProkBERT configurations, along with mappings between command-line arguments and configuration parameters.

This method combines sequence configuration parameters with training configuration parameters and sets up a command-line argument parser using these combined settings. It ensures that parameter names are unique across different groups by renaming any non-unique parameters.

Returns

A tuple containing: - Configured argparse.ArgumentParser instance for handling ProkBERT configurations. - A dictionary mapping new command-line arguments to their original group and parameter name. - A dictionary mapping each group to a dict that maps the original parameter names

to the new command-line argument names.

Return type

tuple[argparse.ArgumentParser, dict, dict]

Note: The method assumes that the configuration parameters for training and sequence configuration are available within the class.

static get_maximum_segment_length_from_token_count(max_token_counts, shift, kmer): Calcuates how long sequence can be covered

get_maximum_segment_length_from_token_count_from_params(): Calculating the maximum length of the segment from the token count

static get_maximum_token_count_from_max_length(max_segment_length, shift, kmer): Calcuates how long sequence can be covered

get_maximum_token_count_from_max_length_from_params(): Calculating the maximum length of the segment from the token count

ProkBERTConfig

`prokbert.config_utils.ProkBERTConfig`()	Class to manage and validate pretraining configurations.
`prokbert.config_utils.ProkBERTConfig._get_default_pretrain_config_file`()	Retrieve the default pretraining configuration file.
`prokbert.config_utils.ProkBERTConfig.get_set_parameters`(...)	Retrieve and validate the provided parameters for a given parameter class.
`prokbert.config_utils.ProkBERTConfig.get_and_set_model_parameters`([...])	Setting the model parameters
`prokbert.config_utils.ProkBERTConfig.get_and_set_dataset_parameters`([...])	Setting the dataset parameters
`prokbert.config_utils.ProkBERTConfig.get_and_set_pretraining_parameters`([...])	Setting the model parameters
`prokbert.config_utils.ProkBERTConfig.get_and_set_datacollator_parameters`([...])	Setting the model parameters
`prokbert.config_utils.ProkBERTConfig.get_and_set_segmentation_parameters`([...])
`prokbert.config_utils.ProkBERTConfig.get_and_set_tokenization_parameters`([...])
`prokbert.config_utils.ProkBERTConfig.get_and_set_computation_params`([...])

class prokbert.config_utils.ProkBERTConfig

Bases: BaseConfig

Class to manage and validate pretraining configurations.

get_and_set_datacollator_parameters(parameters: dict = {}) → dict: Setting the model parameters

get_and_set_dataset_parameters(parameters: dict = {}) → dict: Setting the dataset parameters

get_and_set_finetuning_parameters(parameters: dict = {}) → dict: Setting the finetuning parameters

get_and_set_model_parameters(parameters: dict = {}) → dict: Setting the model parameters

get_and_set_pretraining_parameters(parameters: dict = {}) → dict: Setting the model parameters

get_cmd_arg_parser(keyset=[]) → tuple[argparse.ArgumentParser, dict, dict]

Create and return a command-line argument parser for ProkBERT configurations, along with mappings between command-line arguments and configuration parameters.

This method combines sequence configuration parameters with training configuration parameters and sets up a command-line argument parser using these combined settings. It ensures that parameter names are unique across different groups by renaming any non-unique parameters.

Returns

A tuple containing: - Configured argparse.ArgumentParser instance for handling ProkBERT configurations. - A dictionary mapping new command-line arguments to their original group and parameter name. - A dictionary mapping each group to a dict that maps the original parameter names

to the new command-line argument names.

Return type

tuple[argparse.ArgumentParser, dict, dict]

Note: The method assumes that the configuration parameters for training and sequence configuration are available within the class.

get_set_parameters(parameter_class: str, parameters: dict = {}) → dict

Retrieve and validate the provided parameters for a given parameter class.

Parameters

parameter_class (str) – The class/category of the parameter (e.g., ‘data_collator’).
parameters (dict) – A dictionary of parameters to be validated.

Returns

A dictionary of validated parameters.

Return type

dict

Raises

ValueError – If an invalid parameter is provided.

Config YAMLs

pretraining.yaml

# One can set here all the parameters needed for the pretraining. Important note, is that it is responsibility of the user to provide proper inputs for the model.
# Note if not parameter provided then the default is used. 

data_collator:
  # Data collator related parameters
  # ProkBERT applies and overlapping k-mer strategy. Therefore if one simply mask a token, it is trivially reconstruable from the neighbouring tokens, because of the overlap. To define proper masking exercise the datacollater mask tokens to the left and right as well. 
  mask_to_left:
    default: 3
    type: "integer"
    description: "The number of tokens to be masked to the left of the original mask tokens to avoid data leaked."
    constraints:
      min: 0
  mask_to_right:
    default: 2
    type: "integer"
    description: "The number of tokens to be masked to the RIGHT of the original mask tokens to avoid data leaked."
    constraints:
      min: 0
  mlm_probability:
    default: 0.05
    type: "float"
    description: "The probability of defining a task on a given token. "
    constraints:
      min: 0.0
      max: 1.0  
  replace_prob:
    default: 0.8
    type: "float"
    description: "1- The probability of restoring a masked token. Other others will be changed or restores." 
    constraints:
      min: 0.0
      max: 1.0  
  random_prob:
    default: 0.01
    type: "float"
    description: "The probability of replacing a token with a random token. It's introduce some random errors to avoid overfitting"
    constraints:
      min: 0.0
      max: 1.0  
model:
  model_name:
    default: 'mini'
    type: "string"
    description: "Name of the pretrained ProkBERT model."
  model_outputpath:
    default: '/scratch/fastscratch/NBL/trained_models/test'
    type: "string"
    description: "Path to the models. If it is not defined that it will try to load from the huggingface later. "    
  vocab_size:
    default: 4101
    type: "integer"
    description: "Size of vocabulary, must align with the tokenizer's vocab."
  hidden_size:
    default: 384
    type: "integer"
    description: "Size of the hidden state in the Transformer."
  num_hidden_layers:
    default: 6
    type: "integer"
    description: "Number of hidden layers in the Transformer."
  num_attention_heads:
    default: 6
    type: "integer"
    description: "Number of attention heads for each Transformer layer."
  max_position_embeddings:
    default: 1024
    type: "integer"
    description: "Maximum number of position embeddings."
  intermediate_size:
    default: 2048
    type: "integer"
    description: "Size of the intermediate (feed-forward) layer in the Transformer."
  position_embedding_type:
    default: 'relative_key_query'
    type: "string"
    description: "Type of position embedding. 'relative_key_query' for relative position embeddings."
  ResumeTraining:
    default: True
    type: "bool"
    description: "decide whether to contine the pretraining or not"
    constraints:
      options: [True, False]
  resume_or_initiation_model_path:
    default: ''
    type: "string"
    description: "Path to the model to get the initiation paramters and data from. Default is None and initiate the model randomly" 
dataset:
  dataset_path:
    default: ''
    type: "string"
    description: "Path to the dataset if needed. It shouldn't be empty. It triggers an error. Note that the preprocessed dataset should be aligned with the tokenizer to be used. "
  pretraining_dataset_data:
    default: [[]]
    type: list
    description: "The raw dataset data. It is recommended to use preprocessed HDF data for the training. "
  dataset_class:
    default: 'IterableProkBERTPretrainingDataset'
    type: "string"
    description: "The class of the dataset to be used. The default is IterableProkBERTPretrainingDataset. It is assumed that the dataset is already exists. "
    options: ['ProkBERTPretrainingHDFDataset', 'IterableProkBERTPretrainingDataset', 'ProkBERTPretrainingDataset']
  input_batch_size:
    default: 10000
    type: "int"
    description: "Only for iterative HDF, storage based datasets. The size of the batch to be loaded into the memory from the disk. "
  dataset_iteration_batch_offset:
    default: 0
    type: "int"   
    description: "The offset value, where to start read the dataset. I.e. if the training is restarted, then we should able start the iteration in another position."
    constraints:
      min: 0.0
  max_iteration_over_dataset:
    default: 10
    type: "int"   
    description: "Only for iterative datasets. Maximum how many times we should iterate over a dataset (kind of epoch). I.e. 10 times. After thet stop iteration will be raised"
    constraints:
      min: 0.0
pretraining:
  output_dir:
    default: './train_output'
    type: "string"
    description: "Output directory for training artifacts."
  num_train_epochs:
    default: 1
    type: "float"
    description: "Total number of training epochs."
  save_steps:
    default: 1000
    type: "integer"
    description: "Save model checkpoint every N steps."
  save_total_limit:
    default: 20
    type: "integer"
    description: "Maximum number of total checkpoints to keep."
  logging_steps:
    default: 50
    type: "integer"
    description: "Log metrics every N steps."
  logging_first_step:
    default: True
    type: "boolean"
    description: "Whether to log metrics for the first step."
  per_device_train_batch_size:
    default: 48  # Placeholder; use the appropriate default value
    type: "integer"
    description: "Batch size for training."
  dataloader_num_workers:
    default: 1
    type: "integer"
    description: "Number of subprocesses for data loading."
  learning_rate:
    default: 0.0005
    type: "float"
    description: "Learning rate for training."
  adam_epsilon:
    default: 5e-05
    type: "float"
    description: "Epsilon for the Adam optimizer."
  warmup_steps:
    default: 500
    type: "integer"
    description: "Number of warmup steps for learning rate scheduler."
  weight_decay:
    default: 0.1
    type: "float"
    description: "Weight decay for optimizer."
  adam_beta1:
    default: 0.95
    type: "float"
    description: "Beta1 hyperparameter for the Adam optimizer."
  adam_beta2:
    default: 0.98
    type: "float"
    description: "Beta2 hyperparameter for the Adam optimizer."
  gradient_accumulation_steps:
    default: 1  # Placeholder; use the appropriate default value
    type: "integer"
    description: "Number of steps to accumulate gradients before updating weights."
  optim:
    default: "adamw_torch"
    type: "string"
    description: "Optimizer to use for training."
  ignore_data_skip:
    default: True
    type: "boolean"
    description: "Whether to ignore data skip or not."
    
segmentation:
  type: 'random'
# For full definiation, please see the documentation of the sequence_processing.yaml
tokenization:
  kmer: 6
  shift: 1
# For full definiation, please see the documentation of the sequitls parameters
computation:
  numpy_token_integer_prec_byte: 2
finetuning:
  ftmodel:
    default: ""
    type: "string"
    description: "Model name for the finetuning"
  modelclass:
    default: ""
    type: "string"
    description: "Modell class to perform the analysis weights."

    

sequence_processing.yaml

segmentation:
  type:
    default: contiguous
    type: "string"
    description: "Defines the segmentation type. 'contiguous' means non-overlapping sections of the sequence are selected end-to-end. In 'random' segmentation, fragments are uniformly sampled from the original sequence."
    constraints:
      options: ["contiguous", "random"]
  min_length:
    default: 0
    type: "integer"
    description: "Sets the minimum length for a segment. Any segment shorter than this will be discarded."
    constraints:
      min: 0
  max_length:
    default: 512
    type: "integer"
    description: "Specifies the maximum length a segment can have."
    constraints:
      min: 0
  coverage:
    default: 1.0
    type: "float"
    description: "Indicates the expected average coverage of any position in the sequence by segments. This is only applicable for type=random. Note that because segments are uniformly sampled, the coverage might vary, especially at the sequence ends."
    constraints:
      min: 0.0
      max: 100.0
tokenization:
  type:
    default: lca
    type: "string"
    description: "Describes the tokenization approach. By default, the LCA (Local Context Aware) method is used."
    constraints:
      options: ["lca"]
  kmer:
    default: 6
    type: "integer"
    description: "Determines the k-mer size for the tokenization process."
    constraints:
      options: [1, 2, 3, 4, 5, 6, 7, 8, 9]
  shift:
    default: 1
    type: "integer"
    description: "Represents the shift parameter in k-mer. The default value is 1."
    constraints:
      min: 0
  max_segment_length:
    default: 2050
    type: "integer"
    description: "Gives the maximum number of characters in a segment. This should be consistent with the language model's capability. It can be alternated with token_limit."
    constraints:
      min: 6
      max: 4294967296
  token_limit:
    default: 4096
    type: "integer"
    description: "States the maximum token count that the language model can process, inclusive of special tokens like CLS and SEP. This is interchangeable with max_segment_length."
    constraints:
      min: 1
      max: 4294967296
  max_unknown_token_proportion:
    default: 0.9999
    type: "float"
    description: "Defines the maximum allowed proportion of unknown tokens in a sequence. For instance, if 10% of the tokens are unknown (when max_unknown_token_proportion=0.1), the segment won't be tokenized."
    constraints:
      min: 0
      max: 1
  vocabfile:
    default: auto
    type: "str"
    description: "Path to the vocabulary file. If set to 'auto', the default vocabulary is utilized."
  vocabmap:
    default: {}
    type: "dict"
    description: "The default vocabmap loaded from file"  
  isPaddingToMaxLength:
    default: False
    type: "bool"
    description: "Determines if the tokenized sentence should be padded with [PAD] tokens to produce vectors of a fixed length."
    constraints:
      options: [True, False]
  add_special_token:
    default: True
    type: "bool"
    description: "The tokenizer should add the special starting and setence end tokens. The default is yes"
    constraints:
      options: [True, False]    
computation:
  cpu_cores_for_segmentation:
    default: 10
    type: "integer"
    description: "Specifies the number of CPU cores allocated for the segmentation process."
    constraints:
      min: 1
  cpu_cores_for_tokenization:
    default: -1
    type: "integer"
    description: "Allocates a certain number of CPU cores for the k-mer tokenization process."
    constraints:
      min: 1
  batch_size_tokenization:
    default: 10000
    type: "integer"
    description: "Determines the number of segments a single core processes at a time. The input segment list will be divided into chunks of this size."
    constraints:
      min: 1
  batch_size_fasta_segmentation:
    default: 3
    type: "integer"
    description: "Sets the number of fasta files processed in a single batch, useful when dealing with a large number of fasta files."
    constraints:
      min: 1
  numpy_token_integer_prec_byte:
    default: 2
    type: "integer"
    description: "The type of integer to be used during the vectorization. The default is 2, if you want to work larger k-mers then increase it to 4. 1: np.int8, 2:np.int16. 4:np.int32. 8: np.int64"
    constraints:
      options: [1, 2, 4, 8]
  np_tokentype:
    default: np.int64
    type: "type"
    description: "Dummy"