Classes concerning data structures.

DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.



class BaseDataStructure():

Base DataStructure class.


class DataStructure(    table: Optional[Union[str, Mapping[str, str]]] = None,    query: Optional[Union[str, Mapping[str, str]]] = None,    schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None,    target: Optional[Union[str, List[str]]] = None,    ignore_cols: List[str] = <factory>,    selected_cols: List[str] = <factory>,    data_splitter: Optional[DatasetSplitter] = None,    loss_weights_col: Optional[str] = None,    multihead_col: Optional[str] = None,    multihead_size: Optional[int] = None,    ignore_classes_col: Optional[str] = None,    image_cols: Optional[List[str]] = None,    batch_transforms: Optional[List[Dict[str, _JSONDict]]] = None,    dataset_transforms: Optional[List[Dict[str, _JSONDict]]] = None,):

Information about the columns of a BaseSource.

This component provides the desired structure of data to be used by discriminative machine learning models.


  • table: The table in the Pod schema to be used for local data. If executing a remote task, this should be a mapping of Pod names to table names. Defaults to None.
  • query: The sql query that needs to be applied to the data. It should be a string if it is used for local data or a mapping of Pod names to the queries. Defaults to None.
  • schema_types_override: A mapping that defines the new data types that will be returned after the sql query is executed. For a local training task it will be a mapping of column names to their types, for a remote task it will be a mapping of the Pod name to the new columns and types. If a column is defined as categorical, the mapping should include a mapping to the categories. Required if a sql query is provided. E.g. {'Pod_id': {'categorical': [{'col1': {'value_1':0, 'value_2': 1}}],
  • 'continuous': ['col2']} for remote training or {'categorical':
  • [{'col1': {'value_1':0, 'value_2': 1}}],'continuous': ['col2']} for local training. Defaults to None.
  • target: The training target column or list of columns.
  • ignore_cols: A list of columns to ignore when getting the data. Defaults to None.
  • selected_cols: A list of columns to select when getting the data. Defaults to None.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • loss_weights_col: A column name which provides a weight to be given to each sample in loss function. Defaults to None.
  • multihead_col: A categorical column whereby the number of unique values will determine number of heads in a Neural Network. Used for multitask training. Defaults to None.
  • multihead_size: The number of uniques values in the multihead_col. Used for multitask training. Required if multihead_col is provided. Defaults to None.
  • ignore_classes_col: A column name denoting which classes to ignore in a multilabel multiclass classification problem. Each value is expected to contain a list of numbers corresponding to the indices of the classes to be ignored as per the order provided in target. E.g. [0,2,3]. An empty list can be provided (e.g. []) to avoid ignoring any classes for some samples. Defaults to None.
  • image_cols: A list of columns that will be treated as images in the data.
  • batch_transforms: A dictionary of transformations to apply to batches. Defaults to None.
  • dataset_transforms: A dictionary of transformations to apply to the whole dataset. Defaults to None.


  • DataStructureError: If 'sql_query' is provided as well as either selected_cols or ignore_cols.
  • DataStructureError: If both ignore_cols and selected_cols are provided.
  • DataStructureError: If the multihead_col is provided without multihead_size.
if the datastructure includes image columns, batch transformation

will be applied to them.


Static methods

def create_datastructure(    table_config: DataStructureTableConfig,    select: DataStructureSelectConfig,    transform: DataStructureTransformConfig,    assign: DataStructureAssignConfig,)> DataStructure:

Creates a datastructure based on the yaml config.


  • table: The table in the Pod schema to be used for local data. If executing a remote task, this should a mapping of Pod names to table names.
  • select: The configuration for columns to be included/excluded from the DataStructure.
  • transform: The configuration for dataset and batch transformations to be applied to the data.
  • assign: The configuration for special columns in the DataStructure.

Returns A DataStructure object.

def load_from_file(    file_path: Union[str, PathLike],)> DataStructure:

Loads DataStructure from yaml file.


  • file: A yaml file with the DataStructure configuration.

Returns The loaded DataStructure.


def apply_dataset_transformations(self, datasource: BaseSource)> BaseSource:

Applies transformations to whole dataset.


  • datasource: The BaseSource object to be transformed.

Returns datasource: The transformed datasource.

def get_batch_transformations(    self,)> Optional[List[BatchTimeOperation]]:

Returns batch transformations to be performed as callables.

Returns A list of batch transformations to be passed to TransformationProcessor.

def get_columns_ignored_for_training(self, table_schema: TableSchema)> List[str]:

Adds all the extra columns that will not be used in model training.


  • table_schema: The schema of the table.

Returns ignore_cols_aux: A list of columns that will be ignored when training a model.

def get_pod_identifiers(self)> Optional[List[str]]:

Returns a list of pod identifiers specified in the table attribute.

If there are no pod identifiers specified, returns None.

def get_table_name(self, pod_identifier: Optional[str] = None)> str:

Returns the relevant table name of the DataStructure.

Returns The table name of the DataStructure corresponding to the pod_identifier provided or just the local table name if running locally.


  • ValueError: If the pod_identifier is not provided and there are different table names for different pods.
def get_table_schema(    self,    schema: BitfountSchema,    pod_identifier: Optional[str] = None,    datasource: Optional[BaseSource] = None,)> TableSchema:

Returns the table schema based on the datastructure arguments.

This will return either the new schema defined by the schema_types_override if the datastructure has been initialised with a query, or the relevant table schema if the datastructure has been initialised with a table name.


  • schema: The BitfountSchema either taken from the pod or provided by the user when defining a model.
  • pod_identifier: The pod identifier(s) on which the model will be trained on. Defaults to None.
  • datasource: The datasource on which the model will be trained on. Defaults to None.
def set_columns_after_transformations(    self, transforms: List[Dict[str, _JSONDict]],)> None:

Updates the selected/ignored columns based on the transformations applied.

It updates self.selected_cols by adding on the new names of columns after transformations are applied, and removing the original columns unless explicitly specified to keep.


  • transforms: A list of transformations to be applied to the data.
def set_training_column_split_by_semantic_type(self, schema: TableSchema)> None:

Sets the column split by type from the schema.

This method splits the selected columns from the dataset based on their semantic type.


  • schema: The TableSchema for the data.
def set_training_input_size(self, schema: TableSchema)> None:

Get the input size for model training.


  • schema: The schema of the table.
  • table_name: The name of the table.


