prokbert.sequtils.get_rectangular_array_from_tokenized_dataset

prokbert.sequtils.get_rectangular_array_from_tokenized_dataset(tokenized_segments_data: ~typing.Dict[int, ~typing.List[~numpy.ndarray]], shift: int, max_token_count: int, truncate_zeros: bool = True, randomize: bool = True, numpy_dtype: ~typing.Type = <class 'numpy.uint16'>) Tuple[ndarray, DataFrame]

Create a rectangular numpy array that can be used as input to a Language Model (LM) from tokenized segment data.

Parameters
  • tokenized_segments_data (Dict[int, List[np.ndarray]]) – A dictionary where keys are segment ids and values are lists of possible LCA tokenized vectors.

  • shift (int) – Number of LCA offsets.

  • max_token_count (int) – Maximum allowed token count in the output numpy array.

  • truncate_zeros (bool, optional) – If True, truncate columns from the end of the numpy array that only contain zeros. (default=True)

  • randomize (bool, optional) – If True, randomize the order of the rows in the output numpy array. (default=True)

  • numpy_dtype (Type, optional) – Data type of the values in the output numpy array. (default=np.uint16)

Returns

A rectangular numpy array suitable for input to an LM.

Return type

np.ndarray

Returns

A dataframe that describes which row in the numpy array corresponds to which segment and its LCA offset. Columns are: [‘torch_id’, ‘segment_id’, ‘offset’]

Return type

pd.DataFrame