Preparing to Connect Data
Before you can connect your data to a Pod for analysis, you may want to work with your colleagues or partners to answer important questions about the data and how it will be used. If you have a simple dataset or use case, you can likely configure your Pod based on the examples provided in the Bitfount tutorials or in Data Source Configuration Best Practices without referring too much to this guide.
However, for more complex use cases or for your own reference, the answers to the below questions will dictate what arguments you will choose when configuring the Pod using the Pod class. You may wish to consider them before moving to the next step!
Pod Nomenclature
Naming Pods clearly is important for searchability within the Bitfount Hub and for avoiding errors when working with Pods. Names are specified using two arguments:
name
: This is an argument for thePod
class and is the name for the Pod which will be used for interaction with the Pod via yaml or the Bitfount python APIdisplay_name
: This is an argument for thePodDetailsConfig
class and is the name you or your partners will see displayed in the Bitfount Hub when exploring or authorising Pods.
1. What are best practices for naming Pods?
If you will be working with colleagues or external partners on authorised Pods, it is a good idea to ensure your Pods are named such that they will be able to easily find and work with them based on the data they'd expect to be connected to the Pods. Typically, we suggest:
- Make the names and display names as equivalent as possible and human-readable, unless you have understood codes for databases across those who will be interacting with the Pods.
- No underscores or punctuation in names; these names will be rejected. If you need spaces for the
name
argument, use hyphens to separate words.
2. What happens if I make a mistake on Pod names or create two Pods with the same name?
The name
argument is the source of truth for creating new or overwriting existing Pods. If you wish to change a Pod's display name, you can easily do so by re-running the Pod configuration and using the same name
argument as you did previously with the new display_name
you'd like to specify. If you specify different name
arguments in different Pod configurations with the same display name, however, this will create two different Pods with the same display name. This may cause confusion among your colleagues or partners, so it is best practice to check that you are not creating a duplicate Pod prior to configuring a new Pod.
Deleting a Pod is not yet supported by default. Please reach out on the community Slack channel if you run into issues with Pod naming and configurations.
Data Sources
Data sources are Bitfount's term for the format or database type from which a data custodian connects datasets for analysis for permissioning within Bitfount Hub. They are specified in the Pod class using datasource
.
1. In which format or database type is my data currently?
Bitfount supports the below file types and databases by default. If your data is not in one of these formats or accessible by database connection, you may wish to convert it to one of these data sources. Note, if your dataset is a set of image files, Bitfount supports connecting these via any data source (see Data Source Configuration Best Practices for an example). We also provide the option to use custom DataSource plugins if desired.
All Bitfount-supported data sources leverage pandas to ensure your file or database contents are compatible with our systems. We do not impose any Bitfount-specific limitations, however, if you run into errors connecting your data, you may need to specify keyword arguments as a dictionary for Bitfount to pass through to pandas.
2. What kind of analyses will I or my partners wish to perform on the dataset(s)?
The analysis you or your partners wish to perform will affect the data source you choose. Most default data sources support Bitfount-supported task elements by default, however, if your data is in a multi-sheet Excel file, and you or your partners will wish to perform tasks across sheets, you must convert your file to a SQLite format as demonstrated in Data Source Configuration Best Practices.
Supported DataSources
Connecting data to a Bitfount Pod is done by specifying the appropriate DataSource
, which is Bitfount’s class for enabling a Pod to read and access data in its correct format. Bitfount currently supports the following DataSources:
DataSource | Description | Supported Configuration Mechanisms |
---|---|---|
CSV | Supports connection of comma-delimited .csv files | YAML, Bitfount Python API |
Database | Supports connection of PostgreSQL databases | YAML, Bitfount Python API |
DataFrame | Supports connection of Pandas dataframe structures | Bitfount Python API |
Excel file | Supports connection of standard Excel files | YAML, Bitfount Python API |
Intermine | Supports connection to Intermine databases | YAML, Bitfount Python API |
SQLite | Supports connection to SQLite database files | YAML, Bitfount Python API |
For detailed examples on when to use and how to configure each data source type, see Data Source Configuration Best Practices. For more technical details, see the datasources classes page in our API Reference guide. If you don’t see your preferred DataSource here, you may wish to contribute a custom DataSource plugin to Bitfount. Please see Using Custom Data Sources for more details.
Supported Databases
Note: The bitfount
package does not install database specific dependencies, so ensure you’ve done so prior to attempting to connect your database.
Postgres Installation
Bitfount supports most SQL databases as data sources. To use a postgres database as a DataSource within the Bitfount platform you must have the following packages installed:
Package | Version |
---|---|
bitfount | ≥ 0.5.15 |
psycopg2-binary | ≥ 2.7.4 |
Intermine Installation
To set up an IntermineSource within the Bitfount platform, you must have the following packages installed:
Package | Version |
---|---|
bitfount | ≥ 0.5.15 |
intermine | ≥ 1.13.0 |
Want us to support a specific file format or database type not listed here? Please provide feedback in our community!
Data Configuration
The PodDataConfig
class enables you to specify arguments dictating how data will be presented to you or your collaborators and 'read' when performing tasks. Within the Pod
class, these arguments are passed via pod_details_config
, which is an optional argument. To determine whether you need to configure any of these settings, ask yourself:
1. Does my dataset contain any fields/features (including images) for which default data types are not already specified or readable by default?
If you have any fields where values might not conform to common data standards, it is worthwhile to use the force_stypes
argument to specify the semantic type for each field/feature of the dataset. Bitfount will attempt to assign these by default, so this parameter is optional unless your data includes images. Image columns must be specified using the "image"
parameter to map to the columns. Details on this parameter can be found in PodDataConfig
.
2. Do I want to exclude any columns from my dataset from being used for analysis?
It is possible to ignore columns of your dataset for the purposes of connecting data to Bitfount using the ignore_cols
argument defined in your relevant data source class (i.e. CSVSource
, DatabaseSource
, etc.) passed through the datasource_args
argument via PodDataConfig
within your Pod
configuration. This is most commonly used to remove personally-identifiable information fields or fields used for internal use which will not be relevant to partners or the analysis they wish to perform. It is helpful to use this argument if you don't want to create a new cut of data for every collaboration. Take note of any fields you wish to ignore prior to configuring a Pod.
3. Where and how will my file or database be located/accessed?
Depending on your selected data source, there are additional parameters you need to specify in order to properly connect the data. These will be defined on the DataSource level and passed to PodDataConfig
via the datasource_args
argument and typically require you to know file paths, database connection URLs and credentials, or other similar authentication requirements.
See below for a list of additional required parameters by data source type:
Data Source | Parameters |
---|---|
CSVSource | - path : Path or URL to the .csv file |
DatabaseSource | - db_conn : Takes arguments specified in DatabaseConnection to specify database URL and authentication variables.- database_partition_size : Option to dictate how the database is partitioned.- max_row_buffer : An integer representing the maximum number of rows to stream for the database at a given time; useful for large datasets where a user may run into memory issues when querying or training. |
DataFrameSource | - data : Specifies the name of the dataframe to be loaded |
ExcelSource | - path : The path or URL to the Excel file.- sheet_name : The name(s) of the sheet(s) to connect. By default, all sheets are loaded. |
IntermineSource | - service_url : Your Intermine database url- token : Your Intermine authentication token |
Example configurations for each data source type are available on Data Source Configuration Best Practices
4. Will my data source consist of multiple files or image files?
If your data source contains references to file paths and you want to change the root folder, you can optionally use the modifiers
parameter to specify the file path prefix or extension. This is required for image files stored in a directory, so Bitfount can cycle through all of your image data via one data source. An example of the usage of the modifiers
parameter is demonstrated in tutorial 6.
5. Will I or my partners perform ML or SQL tasks requiring test, training, or validation sets of data? Will the experiments require consistency?
If you are unfamiliar with this process, it's common for ML engineers or data scientists to require subsets of data to perform different functions when developing a model or performing analysis. Within the PodDataConfig
class, Bitfount provides data custodians with a mechanism for dictating how to split datasets for these purposes within the data_split
argument. Using this argument requires you to specify the DataSplitConfig
, which takes the data_splitter
parameter.
The default for data_splitter
is to assign 10% of the dataset to a test sample, 10% to a validation sample, and 80% to a sample used for training. If you wish to change the defaults, you will first specify the integer representing the test sample percentage, then the validation sample percentage. The unspecified portion will be assigned to the training set. For example:
...
data_split=DataSplitConfig(data_splitter="30,10",args={}),
...
If your or your partners' analyses require consistent splits, you will also want to specify the seed
parameter in datasource_args
. The seed
parameter dictates the random number from which to start for any randomisation task a data scientist wishes to perform. The recommended value for this parameter is 100
. If you specify the seed, your configuration will look like:
...
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="30,10",args={}),
...
6. Does my data require a bit of additional cleaning?
Bitfount does support the removal of NaNs and the normalisation of numeric values if you'd like. Just set auto_tidy=True
in the PodDataConfig
when configuring the Pod if you would like us to perform this step on your behalf. Otherwise, this parameter is optional.
Data Schemas
When you connect a dataset to a Pod, you will need to define its schema to ensure you or your collaborators are able to correctly perform tasks on the dataset later.
The data schema is displayed on the Pod’s profile page in the Bitfount Hub so that Data Scientists can understand which data fields are available to them for analysis and their semantic types (e.g. integers, strings, floats, etc). If you’ve correctly specified a DataSource, but do not specify the schema for the Pod, Bitfount will attempt to define the schema on your behalf. However, you may wish to consider the following before leaving the schema
argument as None
:
1. Are the column or file headers for my dataset what we want to see in the Bitfount Hub or to use in performing tasks? Are they human-readable?
Bitfount currently does not support the alteration or specification of field/feature headers, so we recommend you set file or column headers to human-readable names or references you or your collaborators will understand prior to connecting data to a Pod.
2. Do I want to include multiple tables or Excel sheets in the Pod?
The BitfountSchema
class allows you to specify multiple table_name
s or descriptions for tables or columns. This will allow you to associate multiple tables from a given data source to a single Pod if desired. These can then be passed to the Pod
class via the schema
argument.
Multi-Pod Interactions
By default, Pods are not set up to enable the performance of tasks across multiple Pods. To determine whether you need to leverage the approved_pods
argument for the Pod
class when configuring a Pod, ask:
Will I or my collaborators need to use this Pod's data in combination with that of another Pod?
If no, do not specify the approved_pods
argument. If yes, be sure to specify the list of Pods for which you are comfortable for tasks to be run across in concert with the Pod you are configuring. Keep in mind:
- Any Pods you list will be permissible for querying or running ML tasks in combination with one another only if a Data Scientist has permission to all Pods in the list.
- If the Pods you list do not also list your Pod in their
approved_pods
list(s), Data Scientists who have access to the Pods still will not be able to perform tasks across them. - If you will use the same dataset for multiple partnerships, and the
approved_pods
list will differ depending on who is accessing the dataset, you will need to create multiple Pods from the same dataset and separately specify theirapproved_pods
lists. Be sure to authorise the correct Pods to the correct partners in this case.
Privacy-Preserving Configurations
Pod owners have the option to specify additional privacy-preserving controls atop datasets connected to a given Pod. This is typically done based on the DP Modeller
role you would assign to a given user with access to your Pod. However, if you will always want differential privacy to be applied to your dataset, you can override these user-level controls and assign them at the Pod-level. This allows you to enforce the guarantees various privacy-preserving techniques provide if desired. Today, Bitfount supports configurable differential privacy controls. To determine whether you need to set the pod_dp
argument, ask yourself:
Is my dataset sensitive to the degree it requires additional privacy protections, and/or do I have concerns that my partners will perform malicious attacks against the data?
Most datasets do not require additional privacy-preserving controls by default, in which case specifying pod_dp
is unnecessary. You may wish to apply these controls if you are dealing with highly regulated or sensitive data, such as patient healthcare records or financial transaction records.
If you're unfamiliar with differential privacy concepts, we cover the basics in tutorial 10. Basically, a privacy budget is typically determined based on your risk tolerance for the given dataset -- A rule of thumb is that the lower your risk tolerance, the higher one should set the budget. However, you must also balance this with the "usefulness" of the data to the Data Scientist. If you set the budget too high, the data will no longer provide the Data Scientist with valuable insights. As a result of this dynamic, we have set what we believe to be a reasonable default budget of 3 for each task -- We believe this provides sufficient privacy protections for somewhat sensitive data where the Data Custodian permissions access only to relatively trusted parties.
Next Steps
Now that you've thought through what you'll need to create and run a Pod, it's time to connect some data! Head to Connecting Data & Running Pods for more details.