prokbert.sequtils.get_rectangular_array_from_tokenized_dataset

prokbert.sequtils.get_rectangular_array_from_tokenized_dataset(tokenized_segments_data: ~typing.Dict[int, ~typing.List[~numpy.ndarray]], shift: int, max_token_count: int, truncate_zeros: bool = True, randomize: bool = True, numpy_dtype: ~typing.Type = <class 'numpy.uint16'>) → Tuple[ndarray, DataFrame]

Create a rectangular numpy array that can be used as input to a Language Model (LM) from tokenized segment data.

Parameters

tokenized_segments_data (Dict[int, List[np.ndarray]]) – A dictionary where keys are segment ids and values are lists of possible LCA tokenized vectors.
shift (int) – Number of LCA offsets.
max_token_count (int) – Maximum allowed token count in the output numpy array.
truncate_zeros (bool, optional) – If True, truncate columns from the end of the numpy array that only contain zeros. (default=True)
randomize (bool, optional) – If True, randomize the order of the rows in the output numpy array. (default=True)
numpy_dtype (Type, optional) – Data type of the values in the output numpy array. (default=np.uint16)

Returns

A rectangular numpy array suitable for input to an LM.

Return type

np.ndarray

Returns

A dataframe that describes which row in the numpy array corresponds to which segment and its LCA offset. Columns are: [‘torch_id’, ‘segment_id’, ‘offset’]

Return type

pd.DataFrame