# Pipeline (Snakemake + Hydra)

author: Ivan Milovidov

Data

this pipelane uses the titanic dataset It contains information about the passengers on the Titanic

The goal is to train a model to predict whether a Titanic passenger survived . ## Prepare data The data is prepared in two versions, using different scaling(StandartScale and RobustScale)

Models

The models trained within the Pipeline are:

GradientBoosting:

_target_: sklearn.ensemble.GradientBoostingClassifier
random_state: 42
n_estimators: 500
learning_rate: 0.01
max_depth: 2

Random Forest:

_target_: sklearn.ensemble.RandomForestClassifier
n_estimators: 200
max_depth: 200
criterion: "entropy"
min_samples_split: 2
random_state: 42
Model parameters are taken from configs using Hydra

Pipeline

As part of the task, the pipelines were run using a conda virtual environment containing Scikit-learn, etc. packages to train the ML model inside the Snakemake pipelines. The launch command:

‘snakemake –cores 4’

DAG

To build dag launch command:

snakemake --dag | dot -Tsvg > pipeline.svg.

Stage of pipelines

  1. load_data - loading data from kaggle folder
  2. prepare_data - preprocessing dataset with two types of scaling(RobustScaler, StandartScaler)
  3. train_model - fitting models of GradientBoosting and RandomForest with two variants input data

Artifacts

Pipeline trained 2 models with 2 different variants of data preprocessing.

Code of pipelines

Code

Hydra

In pipeline use a Hydra - framework for configuring. In this project use hydra Compose-API(‘workflows/scripts/prepare_data’)

On this stage realized function to read params from config:

def return_config(config_path: str, config: str):
    with initialize(config_path=config_path):
        config = compose(config)
    return config

and setting params from config:

simple_imputer_config = return_config(
        "../../config/preprocessing", config="simpleimputer.yaml"
    )
SimpleImputer(**simple_imputer_config)

Also use an instantiate(‘workflows/scripts/train_model’) to initialise the model inside auxiliary python modules

For this realized function with writing overrides:

def compose_config(overrides: list[str] | None = None) -> DictConfig:
    """get a config for overriding params

    Args:
        overrides (list[str] | None, optional): params with override values

    Returns:
        DictConfig: config
    """

    with initialize(config_path="../../config"):
        hydra_config = compose(config_name="config", overrides=overrides)
    return hydra_config

and example to use it in implementing model in code:

if model_type == "GBT" or model_type == "RNDF":
        cfg = compose_config(overrides=[f"model='{model_type}'"])
        model = instantiate(cfg["model"])