👉🏼

Import connector to AWS S3 and Google Cloud Storage

 
Data connectors to AWS S3 and Google cloud storage to retrieve csv and parquet files. To retrieve the data - create a data source, create the dataset, and add it to your pipeline.
Datrics updates the data from the files stored in the folder in the cloud storage on each pipeline run. User may specify to load the files from the specified folder/file path, or define the dynamic list of path/s using python code.
 

Create data source

There are two options to create a data source.
  • In the dataset tab in the group or project
  • In the dataset creation section on the scene
 
  1. Select the object storage from the lost of connectors - AWS S3 or Google CS
Datrics connectors list
Datrics connectors list
  1. Enter credentials to connect to your bucket, test the connection and save the datasource.
AWS S3 data source creation
AWS S3 data source creation
 

Create dataset with the data from files stored in the cloud storage

  1. Select the created data source to upload the data. Specify the dataset name
  1. Select the file type you would like to upload
  1. Set up the file path
There are two options to define the file path: static and dynamic.
In Static mode, you may define the direct path to the file or folder in the bucket.
  • If you specify a direct file path, then only 1 file is uploaded.
  • If you specify a path to the folder, then all files of the selected file will be combined into one dataset. CSV files should be of the same structure. Please note, that files from the subfolder will not be loaded.
    • Dataset creation from AWS S3 with static folder path
      Dataset creation from AWS S3 with static folder path
In Dynamic mode, you may define the set of files or folders using python code. For example, when you need to retrieve the file from the folder with today’s date in the name, you may do that with the line of code. In this case, in the pipeline the each day new file will be retrieved.
Validate the code to check that the list of file/folder paths are generated as expected.
Dataset creation from AWS S3 with danamic folder path
Dataset creation from AWS S3 with danamic folder path
Examples of the python code to define dynamic paths to folders:
  1. List of the folders with the specified dates in the name. Result: path/date=2023-01-04/, path/date=2023-01-03/, path/date=2023-01-02/
def generate() -> list: return ["path/date=2023-01-04/", "path/date=2023-01-03/", "path/date=2023-01-02/"]
  1. Path to the folder with the current data in the folder name. Result: path/date=2023-06-07/
def generate() -> list: import datetime return [f"path/date={datetime.date.today().strftime('%Y-%m-%d')}/"]
  1. List of folder paths for the date range. Result of the code below: path/date=2023-06-07/, path/date=2023-06-06/, path/date=2023-06-05/
def generate() -> list: import datetime dates = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') + "/" for x in range(0, 3)] dates_formatted = [f"path/date={d}" for d in dates] return dates_formatted
 

Using object storage dataset in the pipeline

Add the dataset to the pipeline from the list of datasets.
notion image