Datasets
IterableProkBERTPretrainingDataset
|
- class prokbert.prok_datasets.IterableProkBERTPretrainingDataset(file_path: str, input_batch_size: int = 10000, ds_offset: int = 0, max_iteration_over_ds: int = 10, default_dtype=torch.int64, add_end_token=False)
Bases:
IterableDataset- __getitem__(index) Union[Tensor, List[Tensor]]
Get item or slice from the dataset.
- Parameters
index – Index or slice object
- Returns
Dataset item or slice
- __init__(file_path: str, input_batch_size: int = 10000, ds_offset: int = 0, max_iteration_over_ds: int = 10, default_dtype=torch.int64, add_end_token=False)
Initialize the IterableProkBERTPretrainingDataset.
- Parameters
file_path – Path to the HDF5 file.
input_batch_size – Batch size for data fetching.
ds_offset – Offset for data fetching.
max_iteration_over_ds – Maximum number of iterations over the dataset.
- Example:
>>> dataset = RefactoredIterableProkBERTPretrainingDataset(file_path="path/to/file.hdf5") >>> for data in dataset: >>> # process data