`ml`

Utility functions that can be used ML jobs and Kaggle.

Reference for kaggle API: https://github.com/Kaggle/kaggle-api

Working with datasets

are_features_consistent

 are_features_consistent (df1:pandas.core.frame.DataFrame,
                          df2:pandas.core.frame.DataFrame,
                          dependent_variables:list[str]=None,
                          raise_error:bool=False)

Verify that features/columns in training and test sets are consistent

	Type	Default	Details
df1	pd.DataFrame		First set, typically the training set
df2	pd.DataFrame		Second set, typically the test set or inference set
dependent_variables	list[str]	None	List of column name(s) for dependent variables
raise_error	bool	False	True to raise an error if not consistent
Returns	bool		True if features in train and test datasets are consistent, False otherwise

Training set and test set should have the same features/columns, except for the dependent variable(s). This function tests whether this is the case.

feats = [f"Feature_{i:02d}" for i in range(10)]
X_train = pd.DataFrame(np.random.normal(size=(500, 10)), columns=feats)
X_test = pd.DataFrame(np.random.normal(size=(100, 10)), columns=feats)
X_test_not_consistant = X_test.iloc[:, 2:]
display(X_train.head(3))
display(X_test.head(3))
display(X_test_not_consistant.head(3))

	Feature_00	Feature_01	Feature_02	Feature_03	Feature_04	Feature_05	Feature_06	Feature_07	Feature_08	Feature_09
0	1.394439	0.266156	-0.070705	-0.462835	0.025394	0.361311	0.801035	0.205413	0.941988	2.868571
1	0.740853	-1.390509	-1.583919	-1.951328	-0.739606	0.775896	-0.060068	0.121640	0.864439	1.192721
2	0.526661	0.233771	1.028485	0.284115	-0.448474	0.512852	-0.673979	0.426295	-0.181841	0.455442

	Feature_00	Feature_01	Feature_02	Feature_03	Feature_04	Feature_05	Feature_06	Feature_07	Feature_08	Feature_09
0	-1.612301	-0.659610	-0.553156	0.477722	0.498676	-2.585540	1.329870	-1.638286	-0.248535	-1.322088
1	0.857624	1.224392	0.115925	-0.055684	-1.336148	3.651585	0.532247	-1.325887	-0.616351	-1.350044
2	0.381214	-0.024726	0.853689	0.270990	-0.571249	-0.117136	-1.895106	-0.176482	-0.331920	0.671925

	Feature_02	Feature_03	Feature_04	Feature_05	Feature_06	Feature_07	Feature_08	Feature_09
0	-0.553156	0.477722	0.498676	-2.585540	1.329870	-1.638286	-0.248535	-1.322088
1	0.115925	-0.055684	-1.336148	3.651585	0.532247	-1.325887	-0.616351	-1.350044
2	0.853689	0.270990	-0.571249	-0.117136	-1.895106	-0.176482	-0.331920	0.671925

Compare all the features/columns

are_features_consistent(X_train, X_test)

True

are_features_consistent(X_train, X_test_not_consistant)

False

are_features_consistent(X_train, X_test_not_consistant, raise_error=True) should raise an error instead of returning False

test_fail(
    f=are_features_consistent, 
    args=(X_train, X_test_not_consistant),
    kwargs = {'raise_error':True},
    contains="Discrepancy between training and test feature set:",
    msg=f"Should raise a ValueError"
)

When comparing training and inference set, the training set will have more features as it includes the dependant variables. To test the consistency of the datasets, specify whith columns are dependant variables.

For instance, X_train has all features, including the two dependant variables Feature_08 and Feature_09.

X_inference = X_train.iloc[:, :-2]
display(X_train.head(3))
display(X_inference.head(3))

	Feature_00	Feature_01	Feature_02	Feature_03	Feature_04	Feature_05	Feature_06	Feature_07	Feature_08	Feature_09
0	1.394439	0.266156	-0.070705	-0.462835	0.025394	0.361311	0.801035	0.205413	0.941988	2.868571
1	0.740853	-1.390509	-1.583919	-1.951328	-0.739606	0.775896	-0.060068	0.121640	0.864439	1.192721
2	0.526661	0.233771	1.028485	0.284115	-0.448474	0.512852	-0.673979	0.426295	-0.181841	0.455442

	Feature_00	Feature_01	Feature_02	Feature_03	Feature_04	Feature_05	Feature_06	Feature_07
0	1.394439	0.266156	-0.070705	-0.462835	0.025394	0.361311	0.801035	0.205413
1	0.740853	-1.390509	-1.583919	-1.951328	-0.739606	0.775896	-0.060068	0.121640
2	0.526661	0.233771	1.028485	0.284115	-0.448474	0.512852	-0.673979	0.426295

are_features_consistent(X_train, X_inference, dependent_variables=['Feature_08', 'Feature_09'])

True

Kaggle

source

kaggle_setup_colab

 kaggle_setup_colab (path_to_config_file:pathlib.Path|str=None)

Update kaggle API and create security key json file from config file on Google Drive

	Type	Default	Details
path_to_config_file	Path \| str	None	path to the configuration file (e.g. config.cfg)

Technical Background

References: Kaggle API documentation

Kaggle API Token to be placed as a json file at the following location:

    ~/.kaggle/kaggle.json
    %HOMEPATH%\.kaggle\kaggle.json

To access Kaggle with API, a security key needs to be placed in the correct location on colab.

config.cfg file must include the following lines:

    [kaggle]
    kaggle_username = kaggle_user_name
    kaggle_key = API key provided by kaggle

Info on how to get an api key (kaggle.json) here

source

kaggle_list_files

 kaggle_list_files (code:str=None, mode:str='competitions')

List all files available in the competition or dataset for the passed code

	Type	Default	Details
code	str	None	code for the kaggle competition or dataset
mode	str	competitions	mode: `competitions` or `datasets`

source

kaggle_download_competition_files

 kaggle_download_competition_files (competition_code:str=None,
                                    train_files:[]=[], test_files:list=[],
                                    submit_files:list=[],
                                    project_folder:str='ds')

download all files for passed competition, unzip them if required, move them to train, test and submit folders

competition_code: str code of the kaggle competition train_files: list of str names of files to be moved into train folder test_files: list of str names of files to be moved into test folder submit_files: list of str names of files to be moved into submit folder

Others

source

fastbook_on_colab

 fastbook_on_colab ()

Set up environment to run fastbook notebooks for colab

Code extracted from fastbook notebook:

# Install fastbook and dependencies
!pip install -Uqq fastbook

# Load utilities and install them
!wget -O utils.py https://raw.githubusercontent.com/vtecftwy/fastbook/walk-thru/utils.py
!wget -O fastbook_utils.py https://raw.githubusercontent.com/vtecftwy/fastbook/walk-thru/fastbook_utils.py

from fastbook_utils import *
from utils import *

# Setup My Drive
setup_book()

# Download images and code required for this notebook
import os
os.makedirs('images', exist_ok=True)
!wget -O images/chapter1_cat_example.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/master/nbs/images/chapter1_cat_example.jpg
!wget -O images/cat-01.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/cat-01.jpg
!wget -O images/cat-02.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/cat-02.jpg
!wget -O images/dog-01.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/dog-01.jpg
!wget -O images/dog-02.jpg https://raw.githubusercontent.com/vtecftwy/fastai-course-v4/walk-thru/nbs/images/dog-01.jpg