Report of DVC task

This task comleted in branch dvc

In this task fitting model for predict label (“toxic”: 1, “non_toxic”: 0) of comments in “Datashop” site. For fitting have been choosen two 1) Logistic Regression, 2) Catboost

Pipeline

Pipeline consist the next stages: 1) data_preprocessing 2) splitting_data 3) tf_idf fitting 4) training ml model 5) testing model

All stages desribe in dvc.yaml

Also using initializing parameters of models with dvc.api params.yaml

DVC Storage Preparation

dvc init: dvc init
dvc remote add -d --project gdrive gdrive://<URL> dvc config core.autostage true

dvc init: dvc init
dvc remote add -d --project gdrive gdrive://<URL>

dvc remote sending:

dvc add data\dvc_data\raw_data\data.csv
dvc push

running dvc stage:

dvc repro dvc repro <STAGE_NAME>

DVC API

path = "dvc_artifacts/summary.json"
remote_name = "dvc_dz"
repo = "https://gitlab.com/ivan_golt/mlops_course/-/tree/dvc?ref_type=heads"
log_reg_commit = "fea1ad3e"
catboost_commit = "f55871eb"

import json

import pandas as pd
from dvc import api


def get_summary(commit: str, file: str = "/dvc_artifacts/summary.json") -> pd.DataFrame:
    fs = api.DVCFileSystem(rev=commit)
    data_json = fs.read_text(file, encoding="utf-8")
    data = json.loads(data_json)
    data_flatten = data
    return pd.DataFrame(data_flatten)

Logistic Regression metrics and Catboost metrics

print("Logistic Regression model")
print(get_summary(commit=log_reg_commit))
print()
print("Catboost model")
print(get_summary(commit=catboost_commit))

Logistic Regression model
                     0           1  accuracy    macro avg  weighted avg
precision      0.95851     0.93242   0.95675      0.94547       0.95586
recall         0.99492     0.61944   0.95675      0.80718       0.95675
f1-score       0.97637     0.74437   0.95675      0.86037       0.95279
support    42909.00000  4856.00000   0.95675  47765.00000   47765.00000

Catboost model
                     0           1  accuracy    macro avg  weighted avg
precision      0.95166     0.78598   0.93968      0.86882       0.93482
recall         0.98278     0.55890   0.93968      0.77084       0.93968
f1-score       0.96697     0.65327   0.93968      0.81012       0.93508
support    42909.00000  4856.00000   0.93968  47765.00000   47765.00000