Skip to main content

base_source

Module containing BaseSource class.

BaseSource is the abstract data source class from which all concrete data sources must inherit.

Classes

BaseSource

class BaseSource(    *args: Any,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    **kwargs: Any,):

Abstract Base Source from which all other data sources must inherit.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.
  • train_idxs: A numpy array containing the indices of the data which will be used for training.
  • validation_idxs: A numpy array containing the indices of the data which will be used for validation.
  • test_idxs: A numpy array containing the indices of the data which will be used for testing.

Ancestors

Static methods


def init_subclass(    ,)> Callable[[~T, Any, Optional[DatasetSplitter], Optional[int], Optional[Dict[str, DataPathModifiers]], Union[str, Sequence[str], None], Any], None]:

Decorate subclass init to call super class init .

Force all data sources to call super class init to set required attributes.

Methods


def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Implement this method to get single column from dataset.

def get_data(self, **kwargs: Any)> Optional[pandas.core.frame.DataFrame]:

Implement this method to load and return dataset.

def get_dtypes(    self,    **kwargs: Any,)> Dict[str, Union[ExtensionDtype, str, numpy.dtype, Type[Union[str, float, int, complex, bool, object]]]]:

Implement this method to get the columns and column types from dataset.

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Implement this method to get distinct values from list of columns.

def load_data(self, **kwargs: Any)> None:

Load the data for the datasource.

We wrap get_data with lru_cache so this method is idempotent so it can be called multiple times with the same arguments without reloading the data.

Raises

  • TypeError: If data format is not supported.

Variables

  • data : [pandas.core.frame.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) - Data.
  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • multi_table : bool - Implement this method to define whether the data source is multi-table.

MultiTableSource

class MultiTableSource(    *args: Any,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    **kwargs: Any,):

Abstract base source that supports multiple tables.

Methods


def get_data(    self, table_name: Optional[str] = None, **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Implement this method to loads and return dataset.

Variables

  • table_names : List[str] - Implement this method to define whether the data source is multi-table.