Datasets

IterableProkBERTPretrainingDataset

prokbert.prok_datasets.IterableProkBERTPretrainingDataset(...)

class prokbert.prok_datasets.IterableProkBERTPretrainingDataset(file_path: str, input_batch_size: int = 10000, ds_offset: int = 0, max_iteration_over_ds: int = 10, default_dtype=torch.int64, add_end_token=False)

Bases: IterableDataset

__getitem__(index) Union[Tensor, List[Tensor]]

Get item or slice from the dataset.

Parameters

index – Index or slice object

Returns

Dataset item or slice

__init__(file_path: str, input_batch_size: int = 10000, ds_offset: int = 0, max_iteration_over_ds: int = 10, default_dtype=torch.int64, add_end_token=False)

Initialize the IterableProkBERTPretrainingDataset.

Parameters
  • file_path – Path to the HDF5 file.

  • input_batch_size – Batch size for data fetching.

  • ds_offset – Offset for data fetching.

  • max_iteration_over_ds – Maximum number of iterations over the dataset.

Example:
>>> dataset = RefactoredIterableProkBERTPretrainingDataset(file_path="path/to/file.hdf5")
>>> for data in dataset:
>>>     # process data