= "dvc_artifacts/summary.json"
path = "dvc_dz"
remote_name = "https://gitlab.com/ivan_golt/mlops_course/-/tree/dvc?ref_type=heads"
repo = "fea1ad3e"
log_reg_commit = "f55871eb" catboost_commit
Report of DVC task
Report of DVC task
This task comleted in branch dvc
In this task fitting model for predict label (“toxic”: 1, “non_toxic”: 0) of comments in “Datashop” site. For fitting have been choosen two 1) Logistic Regression, 2) Catboost
Pipeline
Pipeline consist the next stages: 1) data_preprocessing 2) splitting_data 3) tf_idf fitting 4) training ml model 5) testing model
All stages desribe in dvc.yaml
Also using initializing parameters of models with dvc.api params.yaml
DVC Storage Preparation
dvc init: dvc init
dvc remote add -d --project gdrive gdrive://<URL>
dvc config core.autostage true
dvc init: dvc init
dvc remote add -d --project gdrive gdrive://<URL>
dvc remote sending:
dvc add data\dvc_data\raw_data\data.csv
dvc push
running dvc stage:
dvc repro
dvc repro <STAGE_NAME>
DVC API
import json
import pandas as pd
from dvc import api
def get_summary(commit: str, file: str = "/dvc_artifacts/summary.json") -> pd.DataFrame:
= api.DVCFileSystem(rev=commit)
fs = fs.read_text(file, encoding="utf-8")
data_json = json.loads(data_json)
data = data
data_flatten return pd.DataFrame(data_flatten)
Logistic Regression metrics and Catboost metrics
print("Logistic Regression model")
print(get_summary(commit=log_reg_commit))
print()
print("Catboost model")
print(get_summary(commit=catboost_commit))
Logistic Regression model
0 1 accuracy macro avg weighted avg
precision 0.95851 0.93242 0.95675 0.94547 0.95586
recall 0.99492 0.61944 0.95675 0.80718 0.95675
f1-score 0.97637 0.74437 0.95675 0.86037 0.95279
support 42909.00000 4856.00000 0.95675 47765.00000 47765.00000
Catboost model
0 1 accuracy macro avg weighted avg
precision 0.95166 0.78598 0.93968 0.86882 0.93482
recall 0.98278 0.55890 0.93968 0.77084 0.93968
f1-score 0.96697 0.65327 0.93968 0.81012 0.93508
support 42909.00000 4856.00000 0.93968 47765.00000 47765.00000