base_source
Module containing BaseSource class.
BaseSource is the abstract data source class from which all concrete data sources must inherit.
Classes
BaseSource
class BaseSource( *args: Any, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None, **kwargs: Any,):
Abstract Base Source from which all other data sources must inherit.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.train_idxs
: A numpy array containing the indices of the data which will be used for training.validation_idxs
: A numpy array containing the indices of the data which will be used for validation.test_idxs
: A numpy array containing the indices of the data which will be used for testing.
Static methods
def init_subclass( ,) ‑> Callable[[~T, Any, Optional[DatasetSplitter], Optional[int], Optional[Dict[str, DataPathModifiers]], Union[str, Sequence[str], None], Any], None]:
Decorate subclass init to call super class init .
Force all data sources to call super class init to set required attributes.
Methods
def get_column( self, col_name: str, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Implement this method to get single column from dataset.
def get_data(self, **kwargs: Any) ‑> Optional[pandas.core.frame.DataFrame]:
Implement this method to load and return dataset.
def get_dtypes( self, **kwargs: Any,) ‑> Dict[str, Union[ExtensionDtype, str, numpy.dtype, Type[Union[str, float, int, complex, bool, object]]]]:
Implement this method to get the columns and column types from dataset.
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Implement this method to get distinct values from list of columns.
def load_data(self, **kwargs: Any) ‑> None:
Load the data for the datasource.
We wrap get_data with lru_cache so this method is idempotent so it can be called multiple times with the same arguments without reloading the data.
Raises
TypeError
: If data format is not supported.
Variables
data : [pandas.core.frame.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame)
- Data.
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
multi_table : bool
- Implement this method to define whether the data source is multi-table.
MultiTableSource
class MultiTableSource( *args: Any, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None, **kwargs: Any,):
Abstract base source that supports multiple tables.
Ancestors
Methods
def get_data( self, table_name: Optional[str] = None, **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:
Implement this method to loads and return dataset.
Variables
table_names : List[str]
- Implement this method to define whether the data source is multi-table.