ml

Utility functions that can be used ML jobs and Kaggle.

Reference for kaggle API: https://github.com/Kaggle/kaggle-api

Working with datasets


source

are_features_consistent

 are_features_consistent (df1:pandas.core.frame.DataFrame,
                          df2:pandas.core.frame.DataFrame,
                          dependent_variables:list[str]=None,
                          raise_error:bool=False)

Verify that features/columns in training and test sets are consistent

Type Default Details
df1 pd.DataFrame First set, typically the training set
df2 pd.DataFrame Second set, typically the test set or inference set
dependent_variables list[str] None List of column name(s) for dependent variables
raise_error bool False True to raise an error if not consistent
Returns bool True if features in train and test datasets are consistent, False otherwise

Training set and test set should have the same features/columns, except for the dependent variable(s). This function tests whether this is the case.

feats = [f"Feature_{i:02d}" for i in range(10)]
X_train = pd.DataFrame(np.random.normal(size=(500, 10)), columns=feats)
X_test = pd.DataFrame(np.random.normal(size=(100, 10)), columns=feats)
X_test_not_consistant = X_test.iloc[:, 2:]
display(X_train.head(3))
display(X_test.head(3))
display(X_test_not_consistant.head(3))
Feature_00 Feature_01 Feature_02 Feature_03 Feature_04 Feature_05 Feature_06 Feature_07 Feature_08 Feature_09
0 1.394439 0.266156 -0.070705 -0.462835 0.025394 0.361311 0.801035 0.205413 0.941988 2.868571
1 0.740853 -1.390509 -1.583919 -1.951328 -0.739606 0.775896 -0.060068 0.121640 0.864439 1.192721
2 0.526661 0.233771 1.028485 0.284115 -0.448474 0.512852 -0.673979 0.426295 -0.181841 0.455442
Feature_00 Feature_01 Feature_02 Feature_03 Feature_04 Feature_05 Feature_06 Feature_07 Feature_08 Feature_09
0 -1.612301 -0.659610 -0.553156 0.477722 0.498676 -2.585540 1.329870 -1.638286 -0.248535 -1.322088
1 0.857624 1.224392 0.115925 -0.055684 -1.336148 3.651585 0.532247 -1.325887 -0.616351 -1.350044
2 0.381214 -0.024726 0.853689 0.270990 -0.571249 -0.117136 -1.895106 -0.176482 -0.331920 0.671925
Feature_02 Feature_03 Feature_04 Feature_05 Feature_06 Feature_07 Feature_08 Feature_09
0 -0.553156 0.477722 0.498676 -2.585540 1.329870 -1.638286 -0.248535 -1.322088
1 0.115925 -0.055684 -1.336148 3.651585 0.532247 -1.325887 -0.616351 -1.350044
2 0.853689 0.270990 -0.571249 -0.117136 -1.895106 -0.176482 -0.331920 0.671925

Compare all the features/columns

are_features_consistent(X_train, X_test)
True
are_features_consistent(X_train, X_test_not_consistant)
False

are_features_consistent(X_train, X_test_not_consistant, raise_error=True) should raise an error instead of returning False

test_fail(
    f=are_features_consistent, 
    args=(X_train, X_test_not_consistant),
    kwargs = {'raise_error':True},
    contains="Discrepancy between training and test feature set:",
    msg=f"Should raise a ValueError"
)

When comparing training and inference set, the training set will have more features as it includes the dependant variables. To test the consistency of the datasets, specify whith columns are dependant variables.

For instance, X_train has all features, including the two dependant variables Feature_08 and Feature_09.

X_inference = X_train.iloc[:, :-2]
display(X_train.head(3))
display(X_inference.head(3))
Feature_00 Feature_01 Feature_02 Feature_03 Feature_04 Feature_05 Feature_06 Feature_07 Feature_08 Feature_09
0 1.394439 0.266156 -0.070705 -0.462835 0.025394 0.361311 0.801035 0.205413 0.941988 2.868571
1 0.740853 -1.390509 -1.583919 -1.951328 -0.739606 0.775896 -0.060068 0.121640 0.864439 1.192721
2 0.526661 0.233771 1.028485 0.284115 -0.448474 0.512852 -0.673979 0.426295 -0.181841 0.455442
Feature_00 Feature_01 Feature_02 Feature_03 Feature_04 Feature_05 Feature_06 Feature_07
0 1.394439 0.266156 -0.070705 -0.462835 0.025394 0.361311 0.801035 0.205413
1 0.740853 -1.390509 -1.583919 -1.951328 -0.739606 0.775896 -0.060068 0.121640
2 0.526661 0.233771 1.028485 0.284115 -0.448474 0.512852 -0.673979 0.426295
are_features_consistent(X_train, X_inference, dependent_variables=['Feature_08', 'Feature_09'])
True

Kaggle


source

kaggle_setup_colab

 kaggle_setup_colab (path_to_config_file:pathlib.Path|str=None)

Update kaggle API and create security key json file from config file on Google Drive

Type Default Details
path_to_config_file Path | str None path to the configuration file (e.g. config.cfg)

Technical Background

References: Kaggle API documentation

Kaggle API Token to be placed as a json file at the following location:

    ~/.kaggle/kaggle.json
    %HOMEPATH%\.kaggle\kaggle.json

To access Kaggle with API, a security key needs to be placed in the correct location on colab.

config.cfg file must include the following lines:

    [kaggle]
    kaggle_username = kaggle_user_name
    kaggle_key = API key provided by kaggle

Info on how to get an api key (kaggle.json) here


source

kaggle_list_files

 kaggle_list_files (code:str=None, mode:str='competitions')

List all files available in the competition or dataset for the passed code

Type Default Details
code str None code for the kaggle competition or dataset
mode str competitions mode: competitions or datasets

source

kaggle_download_competition_files

 kaggle_download_competition_files (competition_code:str=None,
                                    train_files:[]=[], test_files:list=[],
                                    submit_files:list=[],
                                    project_folder:str='ds')

download all files for passed competition, unzip them if required, move them to train, test and submit folders

competition_code: str code of the kaggle competition train_files: list of str names of files to be moved into train folder test_files: list of str names of files to be moved into test folder submit_files: list of str names of files to be moved into submit folder

Others


source

fastbook_on_colab

 fastbook_on_colab ()

Set up environment to run fastbook notebooks for colab

Code extracted from fastbook notebook:

# Install fastbook and dependencies
!pip install -Uqq fastbook

# Load utilities and install them
!wget -O utils.py https://raw.githubusercontent.com/vtecftwy/fastbook/walk-thru/utils.py
!wget -O fastbook_utils.py https://raw.githubusercontent.com/vtecftwy/fastbook/walk-thru/fastbook_utils.py

from fastbook_utils import *
from utils import *

# Setup My Drive
setup_book()

# Download images and code required for this notebook
import os
os.makedirs('images', exist_ok=True)
!wget -O images/chapter1_cat_example.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/master/nbs/images/chapter1_cat_example.jpg
!wget -O images/cat-01.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/cat-01.jpg
!wget -O images/cat-02.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/cat-02.jpg
!wget -O images/dog-01.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/dog-01.jpg
!wget -O images/dog-02.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/dog-01.jpg