Reference for kaggle API: https://github.com/Kaggle/kaggle-api
Working with datasets
source
are_features_consistent
are_features_consistent (df1:pandas.core.frame.DataFrame,
df2:pandas.core.frame.DataFrame,
dependent_variables:list[str]=None,
raise_error:bool=False)
Verify that features/columns in training and test sets are consistent
df1
pd.DataFrame
First set, typically the training set
df2
pd.DataFrame
Second set, typically the test set or inference set
dependent_variables
list[str]
None
List of column name(s) for dependent variables
raise_error
bool
False
True to raise an error if not consistent
Returns
bool
True if features in train and test datasets are consistent, False otherwise
Training set and test set should have the same features/columns, except for the dependent variable(s). This function tests whether this is the case.
feats = [f"Feature_ { i:02d} " for i in range (10 )]
X_train = pd.DataFrame(np.random.normal(size= (500 , 10 )), columns= feats)
X_test = pd.DataFrame(np.random.normal(size= (100 , 10 )), columns= feats)
X_test_not_consistant = X_test.iloc[:, 2 :]
display(X_train.head(3 ))
display(X_test.head(3 ))
display(X_test_not_consistant.head(3 ))
0
1.394439
0.266156
-0.070705
-0.462835
0.025394
0.361311
0.801035
0.205413
0.941988
2.868571
1
0.740853
-1.390509
-1.583919
-1.951328
-0.739606
0.775896
-0.060068
0.121640
0.864439
1.192721
2
0.526661
0.233771
1.028485
0.284115
-0.448474
0.512852
-0.673979
0.426295
-0.181841
0.455442
0
-1.612301
-0.659610
-0.553156
0.477722
0.498676
-2.585540
1.329870
-1.638286
-0.248535
-1.322088
1
0.857624
1.224392
0.115925
-0.055684
-1.336148
3.651585
0.532247
-1.325887
-0.616351
-1.350044
2
0.381214
-0.024726
0.853689
0.270990
-0.571249
-0.117136
-1.895106
-0.176482
-0.331920
0.671925
0
-0.553156
0.477722
0.498676
-2.585540
1.329870
-1.638286
-0.248535
-1.322088
1
0.115925
-0.055684
-1.336148
3.651585
0.532247
-1.325887
-0.616351
-1.350044
2
0.853689
0.270990
-0.571249
-0.117136
-1.895106
-0.176482
-0.331920
0.671925
Compare all the features/columns
are_features_consistent(X_train, X_test)
are_features_consistent(X_train, X_test_not_consistant)
are_features_consistent(X_train, X_test_not_consistant, raise_error=True)
should raise an error instead of returning False
test_fail(
f= are_features_consistent,
args= (X_train, X_test_not_consistant),
kwargs = {'raise_error' :True },
contains= "Discrepancy between training and test feature set:" ,
msg= f"Should raise a ValueError"
)
When comparing training and inference set, the training set will have more features as it includes the dependant variables. To test the consistency of the datasets, specify whith columns are dependant variables.
For instance, X_train has all features, including the two dependant variables Feature_08
and Feature_09
.
X_inference = X_train.iloc[:, :- 2 ]
display(X_train.head(3 ))
display(X_inference.head(3 ))
0
1.394439
0.266156
-0.070705
-0.462835
0.025394
0.361311
0.801035
0.205413
0.941988
2.868571
1
0.740853
-1.390509
-1.583919
-1.951328
-0.739606
0.775896
-0.060068
0.121640
0.864439
1.192721
2
0.526661
0.233771
1.028485
0.284115
-0.448474
0.512852
-0.673979
0.426295
-0.181841
0.455442
0
1.394439
0.266156
-0.070705
-0.462835
0.025394
0.361311
0.801035
0.205413
1
0.740853
-1.390509
-1.583919
-1.951328
-0.739606
0.775896
-0.060068
0.121640
2
0.526661
0.233771
1.028485
0.284115
-0.448474
0.512852
-0.673979
0.426295
are_features_consistent(X_train, X_inference, dependent_variables= ['Feature_08' , 'Feature_09' ])
Kaggle
source
kaggle_setup_colab
kaggle_setup_colab (path_to_config_file:pathlib.Path|str=None)
Update kaggle API and create security key json file from config file on Google Drive
path_to_config_file
Path | str
None
path to the configuration file (e.g. config.cfg)
Technical Background
References: Kaggle API documentation
Kaggle API Token to be placed as a json file at the following location:
~/.kaggle/kaggle.json
%HOMEPATH%\.kaggle\kaggle.json
To access Kaggle with API, a security key needs to be placed in the correct location on colab.
config.cfg
file must include the following lines:
[kaggle]
kaggle_username = kaggle_user_name
kaggle_key = API key provided by kaggle
Info on how to get an api key (kaggle.json) here
source
kaggle_list_files
kaggle_list_files (code:str=None, mode:str='competitions')
List all files available in the competition or dataset for the passed code
code
str
None
code for the kaggle competition or dataset
mode
str
competitions
mode: competitions
or datasets
source
kaggle_download_competition_files
kaggle_download_competition_files (competition_code:str=None,
train_files:[]=[], test_files:list=[],
submit_files:list=[],
project_folder:str='ds')
download all files for passed competition, unzip them if required, move them to train, test and submit folders
competition_code: str code of the kaggle competition train_files: list of str names of files to be moved into train folder test_files: list of str names of files to be moved into test folder submit_files: list of str names of files to be moved into submit folder
Others
source
fastbook_on_colab
fastbook_on_colab ()
Set up environment to run fastbook notebooks for colab
Code extracted from fastbook notebook:
# Install fastbook and dependencies
! pip install - Uqq fastbook
# Load utilities and install them
! wget - O utils.py https:// raw.githubusercontent.com/ vtecftwy/ fastbook/ walk- thru/ utils.py
! wget - O fastbook_utils.py https:// raw.githubusercontent.com/ vtecftwy/ fastbook/ walk- thru/ fastbook_utils.py
from fastbook_utils import *
from utils import *
# Setup My Drive
setup_book()
# Download images and code required for this notebook
import os
os.makedirs('images' , exist_ok= True )
! wget - O images/ chapter1_cat_example.jpg https:// raw.githubusercontent.com/ vtecftwy/ fastai- course- v4/ master/ nbs/ images/ chapter1_cat_example.jpg
! wget - O images/ cat- 01.j pg https:// raw.githubusercontent.com/ vtecftwy/ fastai- course- v4/ walk- thru/ nbs/ images/ cat- 01.j pg
! wget - O images/ cat- 02.j pg https:// raw.githubusercontent.com/ vtecftwy/ fastai- course- v4/ walk- thru/ nbs/ images/ cat- 02.j pg
! wget - O images/ dog- 01.j pg https:// raw.githubusercontent.com/ vtecftwy/ fastai- course- v4/ walk- thru/ nbs/ images/ dog- 01.j pg
! wget - O images/ dog- 02.j pg https:// raw.githubusercontent.com/ vtecftwy/ fastai- course- v4/ walk- thru/ nbs/ images/ dog- 01.j pg