datastructure
Classes concerning data structures.
DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.
Classes
BaseDataStructure
class BaseDataStructure():
Base DataStructure class.
Subclasses
DataStructure
class DataStructure( table: Optional[Union[str, Mapping[str, str]]] = None, query: Optional[Union[str, Mapping[str, str]]] = None, schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None, target: Optional[Union[str, List[str]]] = None, ignore_cols: List[str] = <factory>, selected_cols: List[str] = <factory>, data_splitter: Optional[DatasetSplitter] = None, loss_weights_col: Optional[str] = None, multihead_col: Optional[str] = None, multihead_size: Optional[int] = None, ignore_classes_col: Optional[str] = None, image_cols: Optional[List[str]] = None, batch_transforms: Optional[List[Dict[str, _JSONDict]]] = None, dataset_transforms: Optional[List[Dict[str, _JSONDict]]] = None,):
Information about the columns of a BaseSource.
This component provides the desired structure of data to be used by discriminative machine learning models.
Arguments
table
: The table in the Pod schema to be used for local data. If executing a remote task, this should be a mapping of Pod names to table names. Defaults to None.query
: The sql query that needs to be applied to the data. It should be a string if it is used for local data or a mapping of Pod names to the queries. Defaults to None.schema_types_override
: A mapping that defines the new data types that will be returned after the sql query is executed. For a local training task it will be a mapping of column names to their types, for a remote task it will be a mapping of the Pod name to the new columns and types. If a column is defined ascategorical
, the mapping should include a mapping to the categories. Required if a sql query is provided. E.g. {'Pod_id': {'categorical': [{'col1': {'value_1':0, 'value_2': 1}}],'continuous'
: ['col2']} for remote training or {'categorical':[{'col1'
: {'value_1':0, 'value_2': 1}}],'continuous': ['col2']} for local training. Defaults to None.target
: The training target column or list of columns.ignore_cols
: A list of columns to ignore when getting the data. Defaults to None.selected_cols
: A list of columns to select when getting the data. Defaults to None.data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.loss_weights_col
: A column name which provides a weight to be given to each sample in loss function. Defaults to None.multihead_col
: A categorical column whereby the number of unique values will determine number of heads in a Neural Network. Used for multitask training. Defaults to None.multihead_size
: The number of uniques values in themultihead_col
. Used for multitask training. Required ifmultihead_col
is provided. Defaults to None.ignore_classes_col
: A column name denoting which classes to ignore in a multilabel multiclass classification problem. Each value is expected to contain a list of numbers corresponding to the indices of the classes to be ignored as per the order provided intarget
. E.g. [0,2,3]. An empty list can be provided (e.g. []) to avoid ignoring any classes for some samples. Defaults to None.image_cols
: A list of columns that will be treated as images in the data.batch_transforms
: A dictionary of transformations to apply to batches. Defaults to None.dataset_transforms
: A dictionary of transformations to apply to the whole dataset. Defaults to None.
Raises
DataStructureError
: If 'sql_query' is provided as well as eitherselected_cols
orignore_cols
.DataStructureError
: If bothignore_cols
andselected_cols
are provided.DataStructureError
: If themultihead_col
is provided withoutmultihead_size
.
if the datastructure includes image columns, batch transformation
will be applied to them.
Ancestors
- BaseDataStructure
- bitfount.types._BaseSerializableObjectMixIn
Static methods
def create_datastructure( table_config: DataStructureTableConfig, select: DataStructureSelectConfig, transform: DataStructureTransformConfig, assign: DataStructureAssignConfig,) ‑> DataStructure:
Creates a datastructure based on the yaml config.
Arguments
table
: The table in the Pod schema to be used for local data. If executing a remote task, this should a mapping of Pod names to table names.select
: The configuration for columns to be included/excluded from theDataStructure
.transform
: The configuration for dataset and batch transformations to be applied to the data.assign
: The configuration for special columns in theDataStructure
.
Returns
A DataStructure
object.
def load_from_file( file_path: Union[str, PathLike],) ‑> DataStructure:
Loads DataStructure from yaml file.
Arguments
file
: A yaml file with theDataStructure
configuration.
Returns
The loaded DataStructure
.
Methods
def apply_dataset_transformations(self, datasource: BaseSource) ‑> BaseSource:
Applies transformations to whole dataset.
Arguments
datasource
: TheBaseSource
object to be transformed.
Returns datasource: The transformed datasource.
def get_batch_transformations( self,) ‑> Optional[List[BatchTimeOperation]]:
Returns batch transformations to be performed as callables.
Returns A list of batch transformations to be passed to TransformationProcessor.
def get_columns_ignored_for_training(self, table_schema: TableSchema) ‑> List[str]:
Adds all the extra columns that will not be used in model training.
Arguments
table_schema
: The schema of the table.
Returns ignore_cols_aux: A list of columns that will be ignored when training a model.
def get_pod_identifiers(self) ‑> Optional[List[str]]:
Returns a list of pod identifiers specified in the table
attribute.
If there are no pod identifiers specified, returns None.
def get_table_name(self, pod_identifier: Optional[str] = None) ‑> str:
Returns the relevant table name of the DataStructure
.
Returns
The table name of the DataStructure
corresponding to the pod_identifier
provided or just the local table name if running locally.
Raises
ValueError
: If thepod_identifier
is not provided and there are different table names for different pods.
def get_table_schema( self, schema: BitfountSchema, pod_identifier: Optional[str] = None, datasource: Optional[BaseSource] = None,) ‑> TableSchema:
Returns the table schema based on the datastructure arguments.
This will return either the new schema defined by the schema_types_override if the datastructure has been initialised with a query, or the relevant table schema if the datastructure has been initialised with a table name.
Arguments
schema
: The BitfountSchema either taken from the pod or provided by the user when defining a model.pod_identifier
: The pod identifier(s) on which the model will be trained on. Defaults to None.datasource
: The datasource on which the model will be trained on. Defaults to None.
def set_columns_after_transformations( self, transforms: List[Dict[str, _JSONDict]],) ‑> None:
Updates the selected/ignored columns based on the transformations applied.
It updates self.selected_cols
by adding on the new names of columns after
transformations are applied, and removing the original columns unless
explicitly specified to keep.
Arguments
transforms
: A list of transformations to be applied to the data.
def set_training_column_split_by_semantic_type(self, schema: TableSchema) ‑> None:
Sets the column split by type from the schema.
This method splits the selected columns from the dataset based on their semantic type.
Arguments
schema
: TheTableSchema
for the data.
def set_training_input_size(self, schema: TableSchema) ‑> None:
Get the input size for model training.
Arguments
schema
: The schema of the table.table_name
: The name of the table.
Variables
- static
batch_transforms : Optional[List[Dict[str, Dict[str, Any]]]]
- static
data_splitter : Optional[DatasetSplitter]
- static
dataset_transforms : Optional[List[Dict[str, Dict[str, Any]]]]
- static
fields_dict : ClassVar[Dict[str, marshmallow.fields.Field]]
- static
ignore_classes_col : Optional[str]
- static
ignore_cols : List[str]
- static
image_cols : Optional[List[str]]
- static
loss_weights_col : Optional[str]
- static
multihead_col : Optional[str]
- static
multihead_size : Optional[int]
- static
nested_fields : ClassVar[Dict[str, Mapping[str, Any]]]
- static
query : Union[str, Mapping[str, str], None]
- static
schema_types_override : Union[Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]], Mapping[str, Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]]], None]
- static
selected_cols : List[str]
- static
table : Union[str, Mapping[str, str], None]
- static
target : Union[List[str], str, None]