Skip to main content

Data Source Configuration Best Practices

In Connecting Data & Running Pods, we outlined the mechanisms for configuring Pods. In this guide, we will provide examples by data source for each mechanism and details on best practices for each data source type. Note we leverage various parameter combinations in Pod configuration here which are not representative of all possible configurations. For more details on Pod configuration, please refer to the API Reference.

Jump to:

CSV

A local .csv file is the most common data source for Bitfount data connections.

Best Practices

For .csv files:

  • Pods based on .csv files will run as long as they are not interrupted, meaning Pods configured to point to local files will be interrupted if the machine on which they are hosted is turned off or experiences transient disconnection. Bitfount will attempt to bring Pods back online if there is an interruption, but in most cases they will need to be re-started.
  • Best practice is to point to a .csv file hosted on a server not tied to a user's local machine.
  • Exclude any personally identifiable information fields from the Pod configuration specification.
  • Bitfount will automatically generate the schema for a .csv file based on its header row. Please ensure to include a header row in your .csv file with the column names you wish to be reflected in the Pod's schema.

YAML Configuration Example

The configuration yaml file needs to follow the format specified in the PodConfig class:

pod_name: <enter-pod-name-for-system-id>
datasource: CSVSource

pod_details:
display_name: <Enter Name You'll See in the Hub>
description: >
This is a description of the data connected to this Pod with any relevant details for potential collaborators.
data_config:
ignore_cols: ["Name", "DOB", "National Tax Number"]
force_stypes:
enter-your-pod-name:
categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]
datasource_args:
path: <PATH TO CSV>/<filename>.csv
seed: 100
data_split: 30,10

Bitfount Python API Configuration Example

Using the python API is quite similar to specification via yaml. With the python API, we configure Pods using the PodDetailsConfig and PodDataConfig classes. The former is required to specify the display name and description of the Pod, whilst the latter is used to customise the schema and underlying data in the data source. For more information, refer to the config_schemas reference guide. Note, Pod names cannot include underscores. Here is a .csv python API configuration example:

   pod = Pod(
name="enter-pod-name-for-system-id",
datasource=CSVSource(</PATH/OR_URL/TO/YOUR/CSV_FILE.csv>)
pod_details_config=PodDetailsConfig(
display_name="Hub Pod Display Name",
description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
),
# Specify the structure of the dataset
data_config=PodDataConfig(
# Specify stypes for fields (optional)
force_stypes={
"enter-pod-name-for-system-id": {
"categorical": ["target"],
"image": ["file"]
},
},
# Optional parameters
modifiers=None,
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="30,10", args={}),
),
)

SQL Database

Connecting a database is our recommended approach to Pod creation if you will engage in a lot of asynchronous collaboration with Data Scientists. More details on the DatabaseSource class can be found here.

Best Practices

  • Ensure any databases connected are tabular and postgres compatible.
  • Connect 'always-on' databases which operate independently of local machines to ensure collaborators can always interact with your Pod, even if you yourself are offline.
  • If including multiple tables with a single database connection, ensure you will be happy for both tables to be accessed by any data scientist to which you will give access to the Pod.

YAML Configuration Example

When setting up a Pod with a SQL database data source, the Pod owner has the discretion of giving access to a single table in a database, to multiple tables within the database, or to a specific view of the data within the database. The basic YAML file format remains the same as that of the CSVSource.

Single or Multiple Tables: To set up a Pod with access to a single table, the database connection string and the single table name must be defined.

pod:
name: single-table-database
display_name: Single table database pod
description: > This pod provides access to data from a single table in a database
datasource: DatabaseSource
data_config:
datasource_args:
db_conn: postgresql://{USERNAME}:{DBPASSWORD}@{DBHOST}/{DBNAME}
table_names: [table_name1]

NOTE: If table_names is not specified, all tables the Database user has permission to access will be accessible. To specify multiple tables, add the additional table names to the table_names parameter. For example, table_names: [table_name1, table_name2].

View: To populate a Pod with a restricted view of the data in the database, replace the tables_names field by a query field containing the definition of the view, keeping other descriptive fields the same. e.g.:

...
data_config:
datasource_args:
db_conn: postgresql://{USERNAME}:{DBPASSWORD}@{DBHOST}/{DBNAME}
query: """SELECT * FROM table_name1 t1 JOIN table_name2 t2 ON t1.id = t2.id"""
...

Note, Bitfount supports several formats for specifying the db_conn string. If you'd like to avoid specifying credentials in plaintext, the database connection string is populated by environment variables.

For this example, you would create a file including the following variables:

USERNAME = <the username with which you want to connect to the database>
DBHOST = <the host of your database>
DBNAME = <the name of the database you want to connect to>
DBPASSWORD = <the password for connecting to the database>

NOTE: Environment variables are expected to be upper case as shown above and may need to be exported if you are setting up the Pod in a subprocess.

NOTE: The bitfount package does not install database-specific dependencies. Please see Preparing to Connect Data for recommended dependencies for a select number of databases.

Bitfount Python API Configuration Example

Connecting a database via the Bitfount Python API requires you to specify the database's connection parameters like so:

import os
db_password= os.environ["DB_PASSWORD"]
ds = DatabaseSource(db_conn=DatabaseConnection(conn=f"psql://USERNAME;{db_password}@DBHOST:DBNAME"))
pod = Pod(
name="enter-pod-name-for-system-id",
datasource=DatabaseSource(ds)
pod_details_config=PodDetailsConfig(
display_name="Hub Pod Display Name",
description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
),
# Specify the structure of the dataset
data_config=PodDataConfig(
# Specify stypes for fields (optional)
force_stypes={
"enter-pod-name-for-system-id": {
"categorical": ["target"],
"image": ["file"]
},
},
# Optional parameters
modifiers=None,
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="30,10", args={}),
),
)

DataFrame

Best Practices

  • Ensure you know the structure of the DataFrame prior to Pod configuration.

YAML Configuration Example

Yaml configuration is not supported for DataFrame data sources.

Bitfount Python API Configuration Example

The main difference between connecting DataFrame sources and connecting other source types is the requirement to instantiate the data source with a pd.DataFrame object. An example of how to do this is as follows:

data_structure = {'col1': [1, 2], 'col2': [3, 4]}
dataframe_object = pd.DataFrame(data=data_structure)

pod = Pod(
name="enter-pod-name-for-system-id",
datasource=DataFrameSource(
dataframe_object
),
pod_details_config=PodDetailsConfig(
display_name="Hub Pod Display Name",
description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
),
# Specify the structure of the dataset
data_config=PodDataConfig(
# Specify stypes for fields (optional)
force_stypes={
"enter-pod-name-for-system-id": {
"categorical": ["target"],
"image": ["file"]
},
},
# Optional parameters
modifiers=None,
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="30,10", args={})
)
)

Excel File

Bitfount supports Excel file sources in much the same way as we do CSV files. See below for examples and best practices.

Best Practices

  • If connecting an Excel file, avoid connecting one with multiple sheets you will wish to query across. If you have a multi-sheet Excel file, we suggest converting the file into a SQLite file database and using SQLite as the source instead.
  • Ensure your Excel file has inherited a supported encoding (this should occur by default) and adheres to the pandas requirements.

YAML Configuration Example

Connecting an Excel file via YAML is comparable to connecting a CSV file. See below for an example configuration file:

pod_name: <enter-pod-name-for-system-id>
datasource: ExcelSource

pod_details:
display_name: <Enter Name You'll See in the Hub>
description: >
This is a description of the data connected to this Pod with any relevant details for potential collaborators.
data_config:
ignore_cols: ["Name", "DOB", "National Tax Number"]
force_stypes:
enter-your-pod-name:
categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]
datasource_args:
path: <PATH TO EXCEL>/<filename>.xslx
seed: 100 # optional
data_split: 30,10 # optional

Bitfount Python API Configuration Example

Again, connecting an Excel file via API is comparable to connecting a CSV file. See below for an example configuration command:

   pod = Pod(
name="enter-pod-name-for-system-id",
datasource=ExcelSource(</PATH/OR_URL/TO/YOUR/EXCEL_FILE.xlsx>)
pod_details_config=PodDetailsConfig(
display_name="Hub Pod Display Name",
description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
),
# Specify the structure of the dataset
data_config=PodDataConfig(
# Specify stypes for fields (optional)
force_stypes={
"enter-pod-name-for-system-id": {
"categorical": ["target"],
"image": ["file"]
},
},
# Optional parameters
modifiers=None,
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="30,10", args={}),
),
)

Intermine

Intermine is a database engine allowing for dynamic querying of heterogeneous biological datasets.

Best Practices

  • pip install intermine prior to configuring the Pod.
  • The IntermineSource launches a pod that can access all templates defined under a specified service. Please see Intermine's tutorials for a detailed overview of their python API: https://github.com/intermine/intermine-ws-python-docs.
  • Ensure you've been issued an admin token for connection to your Intermine sources.

YAML Configuration Example

If you'd prefer not to provide the token and service_url parameters in plain-text, you will need to save them as environment variables in a separate file like so:

TOKEN: <token>
SERVICE_URL: <service_url>

Then, you can create the Pod configuration file like so:

pod_name: <enter-pod-name-for-system-id>
datasource: IntermineSource
service_url: SERVICE_URL
token: TOKEN

pod_details:
display_name: <Hub Display Name>
description: >
This Pod contains data from Intermine templats
data_config:
force_stypes:
<enter-your-pod-name>:
categorical: [source.table.table.field_1,source.table.table.field_2]
datasource_args:
seed: 100 # optional
data_split: 30,10 # optional

Bitfount Python API Configuration Example

# Enter admin token:
token = "<issued_token>"

# Configure an Intermine Pod.
pod = Pod(
name="enter-pod-name-for-system-id",
datasource=IntermineSource(
service_url="<https://YOUR_SERVICE_URL/>",
token=token,
),
pod_details_config=PodDetailsConfig(
display_name="Hub Display Name",
description="This Pod contains data from Intermine templates ",
),
data_config=PodDataConfig(
modifiers=None,
datasource_args={"seed": 100},
data_split=DataSplitConfig(data_splitter="percentage1,percentage2", args={}),
force_stypes={
"enter-pod-name-for-system-id": {
"categorical": [
'source.table.table.field_1',
'source.table.table.field_2'
],
},
},
),
)

SQLite

We recommend using SQLite sources if you want to connect an Excel file with multiple sheets/tables. Using SQLite will allow you to run tasks between tables (with the exception of PrivateSQL, which does not support joins) without needing to go through the more complex process of uploading your file contents to a separate postgres database server. SQLite acts the same as a DatabaseSource.

Best Practices

  • Convert any Excel or csv files to SQLite prior to configuring the Pod. We provide a python script template for this below.
  • Ensure you know the accepted SQL field types prior to connecting the Pod.

Converting Excel or CSV to SQLite

Note, you can execute this directly, or split cells in a notebook as desired.

import pandas as pd
import sqlite3

sheet_name_1 = pd.read_excel(
'/PATH/TO/YOUR/EXCEL.xlsx',
sheet_name='<sheet_name_1>',
header=1)

sheet_name_2 = pd.read_excel(
'/PATH/TO/YOUR/EXCEL.xlsx',
sheet_name='<sheet_name_2>',
header=1)

db_conn = sqlite3.connect("</path_to_file/final_db_name.db>")

c = db_conn.cursor()

# sheet_name_1
c.execute(
"""
CREATE TABLE IF NOT EXISTS sheet_name_1 (
field_1 FIELD_TYPE PRIMARY KEY,
field_2 FIELD_TYPE,
field_3 FIELD_TYPE
);
"""
)

# sheet_name_2
c.execute(
"""
CREATE TABLE IF NOT EXISTS sheet_name_2 (
field_1 FIELD_TYPE PRIMARY KEY,
field_2 FIELD_TYPE,
field_3 FIELD_TYPE
);
"""
)

sheet_name_1.to_sql('sheet_name_1', db_conn, if_exists='append', index=False)
sheet_name_2.to_sql('sheet_name_2', db_conn, if_exists='append', index=False)

#test connection
result=pd.read_sql("SELECT * FROM sheet_name_1 LIMIT 5", db_conn)
print(result)
db_conn2 = sqlite3.connect("</path_to_file/final_db_name.db>")
res = db_conn2.read_sql("SELECT * FROM sheet_name_1 LIMIT 5")
print(res)

YAML Configuration Example

Connecting a SQLite database file to a Pod uses the same DatabaseSource as exemplified above. The only difference is that rather than pointing to a hosted database, the SQLite configuration uses the path to the local file:

pod:
name: sqlite-table-database
display_name: <sqlite pod name>
description: > This pod provides access to data from a single table in a sqlite file
datasource: DatabaseSource
data_config:
datasource_args:
db_conn: sqlite:///path/to/local/file.db
table_names: [sheet_name_1,sheet_name_2]

Bitfount Python API Configuration Example

The SQLite data source requires importing additional classes from outside of the bitfount library, like so:

from bitfount import DatabaseSource, Pod
from bitfount.data.utils import DatabaseConnection
from sqlalchemy import create_engine
from bitfount.runners.config_schemas import (PodDetailsConfig)

pod = Pod(
name="enter-pod-name-for-system-id",
datasource=DatabaseSource(
db_conn=DatabaseConnection(create_engine('sqlite:///path/to/local/file.db'))
),
pod_details_config=PodDetailsConfig(
display_name="Hub Display Name",
description='This is a description of the data in my SQLite database file'
),

)

Image Sources

Images can be stored in any supported data source and configured as in the examples above. However, data sources must be configured to have an image reference column indicating the names of the image files.

Best Practices

  • If connecting image data via a database or other non-local source, be sure to use the force_stypes parameter and classify image columns as "image".
  • If connecting image data via files on your local machine, create a reference file .csv with a column indicating all of the file names for the images you wish to connect to the Pod. This column can be something as simple as:
image_file_name
0001.png
...

You can label this column however you wish, though you must be sure to reference it when generating the PodDataConfig for your DataSource.

  • If using a .csv source, we recommend you take note of the filepaths for both the .csv file and the images themselves.
  • Place all images in the same folder or cloud services bucket.

For a detailed example on connecting an image data Pod, see Tutorial 6.

FAQ

Don't see a data source you're hoping to use? See Using Custom Data Sources or reach out to us with your request!