Import connector to AWS S3 and Google Cloud Storage

Data connectors to AWS S3 and Google cloud storage to retrieve csv and parquet files. To retrieve the data - create a data source, create the dataset, and add it to your pipeline.

Datrics updates the data from the files stored in the folder in the cloud storage on each pipeline run. User may specify to load the files from the specified folder/file path, or define the dynamic list of path/s using python code.

Create data source

There are two options to create a data source.

In the dataset tab in the group or project

In the dataset creation section on the scene

Select the object storage from the lost of connectors - AWS S3 or Google CS

Enter credentials to connect to your bucket, test the connection and save the datasource.

Create dataset with the data from files stored in the cloud storage

Select the created data source to upload the data. Specify the dataset name

Select the file type you would like to upload

Set up the file path

There are two options to define the file path: static and dynamic.

In Static mode, you may define the direct path to the file or folder in the bucket.

If you specify a direct file path, then only 1 file is uploaded.

If you specify a path to the folder, then all files of the selected file will be combined into one dataset. CSV files should be of the same structure. Please note, that files from the subfolder will not be loaded.

Dataset creation from AWS S3 with static folder path

In Dynamic mode, you may define the set of files or folders using python code. For example, when you need to retrieve the file from the folder with today’s date in the name, you may do that with the line of code. In this case, in the pipeline the each day new file will be retrieved.

Validate the code to check that the list of file/folder paths are generated as expected.

Dataset creation from AWS S3 with danamic folder path

Examples of the python code to define dynamic paths to folders:

List of the folders with the specified dates in the name. Result: path/date=2023-01-04/, path/date=2023-01-03/, path/date=2023-01-02/


def generate() -> list:
    return ["path/date=2023-01-04/", "path/date=2023-01-03/", "path/date=2023-01-02/"]

Path to the folder with the current data in the folder name. Result: path/date=2023-06-07/


def generate() -> list:
    import datetime
    return [f"path/date={datetime.date.today().strftime('%Y-%m-%d')}/"]

List of folder paths for the date range. Result of the code below: path/date=2023-06-07/, path/date=2023-06-06/, path/date=2023-06-05/


def generate() -> list:
    import datetime
    dates = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') + "/" for x in range(0, 3)]
    dates_formatted = [f"path/date={d}" for d in dates]
    return dates_formatted

Using object storage dataset in the pipeline

Add the dataset to the pipeline from the list of datasets.