datastructure

Classes concerning data structures.

DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.

Classes

BaseDataStructure

class BaseDataStructure():

Base DataStructure class.

Subclasses

DataStructure

DataStructure

class DataStructure(    table: Optional[Union[str, Mapping[str, str]]] = None,    query: Optional[Union[str, Mapping[str, str]]] = None,    schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None,    target: Optional[Union[str, List[str]]] = None,    ignore_cols: List[str] = &lt;factory&gt;,    selected_cols: List[str] = &lt;factory&gt;,    data_splitter: Optional[DatasetSplitter] = None,    loss_weights_col: Optional[str] = None,    multihead_col: Optional[str] = None,    multihead_size: Optional[int] = None,    ignore_classes_col: Optional[str] = None,    image_cols: Optional[List[str]] = None,    batch_transforms: Optional[List[Dict[str, _JSONDict]]] = None,    dataset_transforms: Optional[List[Dict[str, _JSONDict]]] = None,):

Information about the columns of a BaseSource.

This component provides the desired structure of data to be used by discriminative machine learning models.

Arguments

table: The table in the Pod schema to be used for local data. If executing a remote task, this should be a mapping of Pod names to table names. Defaults to None.
query: The sql query that needs to be applied to the data. It should be a string if it is used for local data or a mapping of Pod names to the queries. Defaults to None.
schema_types_override: A mapping that defines the new data types that will be returned after the sql query is executed. For a local training task it will be a mapping of column names to their types, for a remote task it will be a mapping of the Pod name to the new columns and types. If a column is defined as categorical, the mapping should include a mapping to the categories. Required if a sql query is provided. E.g. {'Pod_id': {'categorical': [{'col1': {'value_1':0, 'value_2': 1}}],
'continuous': ['col2']} for remote training or {'categorical':
[{'col1': {'value_1':0, 'value_2': 1}}],'continuous': ['col2']} for local training. Defaults to None.
target: The training target column or list of columns.
ignore_cols: A list of columns to ignore when getting the data. Defaults to None.
selected_cols: A list of columns to select when getting the data. Defaults to None.
data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
loss_weights_col: A column name which provides a weight to be given to each sample in loss function. Defaults to None.
multihead_col: A categorical column whereby the number of unique values will determine number of heads in a Neural Network. Used for multitask training. Defaults to None.
multihead_size: The number of uniques values in the multihead_col. Used for multitask training. Required if multihead_col is provided. Defaults to None.
ignore_classes_col: A column name denoting which classes to ignore in a multilabel multiclass classification problem. Each value is expected to contain a list of numbers corresponding to the indices of the classes to be ignored as per the order provided in target. E.g. [0,2,3]. An empty list can be provided (e.g. []) to avoid ignoring any classes for some samples. Defaults to None.
image_cols: A list of columns that will be treated as images in the data.
batch_transforms: A dictionary of transformations to apply to batches. Defaults to None.
dataset_transforms: A dictionary of transformations to apply to the whole dataset. Defaults to None.

Raises

DataStructureError: If 'sql_query' is provided as well as either selected_cols or ignore_cols.
DataStructureError: If both ignore_cols and selected_cols are provided.
DataStructureError: If the multihead_col is provided without multihead_size.

if the datastructure includes image columns, batch transformation

will be applied to them.

Ancestors

BaseDataStructure
bitfount.types._BaseSerializableObjectMixIn

Static methods

def create_datastructure(    table_config: DataStructureTableConfig,    select: DataStructureSelectConfig,    transform: DataStructureTransformConfig,    assign: DataStructureAssignConfig,) ‑> DataStructure:

Creates a datastructure based on the yaml config.

Arguments

table: The table in the Pod schema to be used for local data. If executing a remote task, this should a mapping of Pod names to table names.
select: The configuration for columns to be included/excluded from the DataStructure.
transform: The configuration for dataset and batch transformations to be applied to the data.
assign: The configuration for special columns in the DataStructure.

Returns A DataStructure object.

def load_from_file(    file_path: Union[str, PathLike],) ‑> DataStructure:

Loads DataStructure from yaml file.

Arguments

file: A yaml file with the DataStructure configuration.

Returns The loaded DataStructure.

Methods

def apply_dataset_transformations(self, datasource: BaseSource) ‑> BaseSource:

Applies transformations to whole dataset.

Arguments

datasource: The BaseSource object to be transformed.

Returns datasource: The transformed datasource.

def get_batch_transformations(    self,) ‑> Optional[List[BatchTimeOperation]]:

Returns batch transformations to be performed as callables.

Returns A list of batch transformations to be passed to TransformationProcessor.

def get_columns_ignored_for_training(self, table_schema: TableSchema) ‑> List[str]:

Adds all the extra columns that will not be used in model training.

Arguments

table_schema: The schema of the table.

Returns ignore_cols_aux: A list of columns that will be ignored when training a model.

def get_pod_identifiers(self) ‑> Optional[List[str]]:

Returns a list of pod identifiers specified in the table attribute.

If there are no pod identifiers specified, returns None.

def get_table_name(self, pod_identifier: Optional[str] = None) ‑> str:

Returns the relevant table name of the DataStructure.

Returns The table name of the DataStructure corresponding to the pod_identifier provided or just the local table name if running locally.

Raises

ValueError: If the pod_identifier is not provided and there are different table names for different pods.

def get_table_schema(    self,    schema: BitfountSchema,    pod_identifier: Optional[str] = None,    datasource: Optional[BaseSource] = None,) ‑> TableSchema:

Returns the table schema based on the datastructure arguments.

This will return either the new schema defined by the schema_types_override if the datastructure has been initialised with a query, or the relevant table schema if the datastructure has been initialised with a table name.

Arguments

schema: The BitfountSchema either taken from the pod or provided by the user when defining a model.
pod_identifier: The pod identifier(s) on which the model will be trained on. Defaults to None.
datasource: The datasource on which the model will be trained on. Defaults to None.

def set_columns_after_transformations(    self, transforms: List[Dict[str, _JSONDict]],) ‑> None:

Updates the selected/ignored columns based on the transformations applied.

It updates self.selected_cols by adding on the new names of columns after transformations are applied, and removing the original columns unless explicitly specified to keep.

Arguments

transforms: A list of transformations to be applied to the data.

def set_training_column_split_by_semantic_type(self, schema: TableSchema) ‑> None:

Sets the column split by type from the schema.

This method splits the selected columns from the dataset based on their semantic type.

Arguments

schema: The TableSchema for the data.

def set_training_input_size(self, schema: TableSchema) ‑> None:

Get the input size for model training.

Arguments

schema: The schema of the table.
table_name: The name of the table.

Variables

static batch_transforms : Optional[List[Dict[str, Dict[str, Any]]]]

static data_splitter : Optional[DatasetSplitter]

static dataset_transforms : Optional[List[Dict[str, Dict[str, Any]]]]

static fields_dict : ClassVar[Dict[str, marshmallow.fields.Field]]

static ignore_classes_col : Optional[str]

static ignore_cols : List[str]

static image_cols : Optional[List[str]]

static loss_weights_col : Optional[str]

static multihead_col : Optional[str]

static multihead_size : Optional[int]

static nested_fields : ClassVar[Dict[str, Mapping[str, Any]]]

static query : Union[str, Mapping[str, str], None]

static schema_types_override : Union[Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]], Mapping[str, Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]]], None]

static selected_cols : List[str]

static table : Union[str, Mapping[str, str], None]

static target : Union[List[str], str, None]

datastructure

Classes​

BaseDataStructure​

Subclasses​

DataStructure​

if the datastructure includes image columns, batch transformation

Ancestors​

Static methods​

Methods​

Variables​

Classes

BaseDataStructure

Subclasses

DataStructure

Ancestors

Static methods

Methods

Variables