datasplitters

Classes for splitting data.

Classes

DatasetSplitter

class DatasetSplitter():

Parent class for different types of dataset splits.

Ancestors

abc.ABC

Subclasses

PercentageSplitter
SplitterDefinedInData

Static methods

def create(    splitter_name: str, **kwargs: Any,) ‑> DatasetSplitter:

Create a DataSplitter of the requested type.

def splitter_name() ‑> str:

Returns string name for splitter type.

Methods

def create_dataset_splits(    self, data: pd.DataFrame,) ‑> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:

Returns indices for data sets.

def get_split_query(self, datasource: DatabaseSource, split: DataSplit) ‑> str:

Modifies the loader SQL query to return a split of the data.

Arguments

loader: A DatabaseLoader object.
split: The relevant split to return from the query.

Returns The modified SQL query.

PercentageSplitter

class PercentageSplitter(    validation_percentage: int = 10,    test_percentage: int = 10,    time_series_sort_by: Optional[Union[List[str], str]] = None,):

Splits data into sets based on percentages.

The default split is 80% of the data is used training, and 10% for each validation and testing, respectively.

Arguments

validation_percentage: The percentage of data to be used for validation. Defaults to 10.
test_percentage: The percentage of data to be used for testing. Defaults to 10.
time_series_sort_by: A string/list of strings to be used for sorting time series. The strings should correspond to feature names from the dataset. This sorts the dataframe by the values of those features ensuring the validation and test sets come after the training set data to remove potential bias during training and evaluation. Defaults to None.

Ancestors

DatasetSplitter
abc.ABC

Static methods

def splitter_name() ‑> str:

Class method for splitter name.

Returns The string name for splitter type.

Methods

def create_dataset_splits(    self, data: pd.DataFrame,) ‑> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:

Create splits in dataset for training, validation and test sets.

Arguments

data: The dataframe type object to be split.

Returns A tuple of arrays, each containing the indices from the data to be used for training, validation, and testing, respectively.

def get_split_query(self, datasource: DatabaseSource, split: DataSplit) ‑> str:

Modifies the loader SQL query to return a split of the data.

Arguments

loader: A DatabaseLoader object.
split: The relevant split to return from the query.

Returns The modified SQL query.

caution

This method will only work for databases that support the LIMIT ... OFFSET syntax. Notably, Microsoft SQL Server does not support this syntax.

caution

It is strongly recommended that you sort the data as part of the SQL query in order to ensure the splits are random. This is because for iterable datasets, the splits are simply taken in order from TRAIN to TEST.

Similarly, time_series_sort_by is ignored and a warning logged if True. If you want to sort by time series, you should do this as part of the SQL Query.

Variables

static test_percentage : int

static time_series_sort_by : Union[List[str], str, None]

static validation_percentage : int

Inherited members

DatasetSplitter:
- DatasetSplitter.create

SplitterDefinedInData

class SplitterDefinedInData(    column_name: str = 'BITFOUNT_SPLIT_CATEGORY',    training_set_label: str = 'TRAIN',    validation_set_label: str = 'VALIDATE',    test_set_label: str = 'TEST',):

Splits data into sets based on value in each row.

The splitting is done based on the values in a user specified column.

Arguments

column_name: The column name for which contains the labels for splitting. Defaults to "BITFOUNT_SPLIT_CATEGORY".
training_set_label: The label for the data points to be included in the training set. Defaults to "TRAIN".
validation_set_label: The label for the data points to be included in the validation set. Defaults to "VALIDATE".
test_set_label: The label for the data points to be included in the test set. Defaults to "TEST".

Ancestors

DatasetSplitter
abc.ABC

Static methods

def splitter_name() ‑> str:

Class method for splitter name.

Returns The string name for splitter type.

Methods

def create_dataset_splits(    self, data: pd.DataFrame,) ‑> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:

Create splits in dataset for training, validation and test sets.

Arguments

data: The dataframe type object to be split.

Returns A tuple of arrays, each containing the indices from the data to be used for training, validation, and testing, respectively.

Variables

static column_name : str

static test_set_label : str

static training_set_label : str

static validation_set_label : str

Inherited members

DatasetSplitter:
- DatasetSplitter.create
- DatasetSplitter.get_split_query

datasplitters

Classes​

DatasetSplitter​

Ancestors​

Subclasses​

Static methods​

Methods​

PercentageSplitter​

Ancestors​

Static methods​

Methods​

caution

caution

Variables​

Inherited members​

SplitterDefinedInData​

Ancestors​

Static methods​

Methods​

Variables​

Inherited members​

Classes

DatasetSplitter

Ancestors

Subclasses

Static methods

Methods

PercentageSplitter

Ancestors

Static methods

Methods

Variables

Inherited members

SplitterDefinedInData

Ancestors

Static methods

Methods

Variables

Inherited members