`eda_stats_utils`

Utility Functions that can be used for exploratory data analysis and statistics. Includes all stable utility functions for eda and statistics.

Data analysis plots

source

ecdf

 ecdf (data:pandas.core.series.Series|numpy.ndarray,
       threshold:Optional[int]=None,
       figsize:Optional[tuple[int,int]]=None)

Compute Empirical Cumulative Distribution Function (ECDF), plot it and returns values.

	Type	Default	Details
data	pd.Series \| np.ndarray		data to analyse
threshold	Optional[int]	None	cummulative frequency used as threshold. Must be between 0 and 1
figsize	Optional[tuple[int, int]]	None	figure size (width, height)
Returns	tuple[np.array, np.array, int]		sorted data (ascending), cumulative frequencies, last index

ecdf plots the empirical cumulative distribution function (ECDF), for data cumulative frequencies from 0 to threshold <= 1.

The empirical cumulative distribution function (ECDF) is a step function that jumps up by 1/n at each of the n data points in the dataset. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

The ECDF is an estimate of the cumulative distribution function that generated the points in the sample. It allows to compare with the distribution that is expected.

df = pd.DataFrame(data={'a': np.random.random(100) * 100,'b': np.random.random(100) * 50,'c': np.random.random(100)})
data_1, freq_1, last_idx_1 = ecdf(data=df.a, threshold=1, figsize=(5, 5))

data_2, freq_2, last_idx_2 = ecdf(data=df.a, threshold=0.75, figsize=(5, 5))

data_3, freq_3, last_idx_3 = ecdf(data=df.a, threshold=0.5, figsize=(5, 5))

The ecdf function also returns: - the data used for the ECDF, with values sorted from smallest to largest - the respective cummulative frequencies - the index of data value/frequency plotted (at the threshold)

data_1

array([ 0.85,  2.1 ,  2.44,  3.22,  3.24,  3.9 ,  4.65,  5.53,  6.08,  6.34,  6.84,  6.98,  8.37,  9.09, 12.58, 12.96,
       13.62, 14.5 , 15.06, 17.11, 17.52, 17.99, 19.12, 22.3 , 22.45, 23.87, 23.87, 24.15, 24.28, 25.95, 26.79, 30.31,
       30.4 , 34.83, 38.01, 38.56, 38.62, 39.47, 39.58, 41.1 , 42.43, 43.57, 43.73, 45.92, 47.19, 47.66, 48.52, 49.03,
       49.6 , 49.98, 51.07, 52.14, 53.06, 53.92, 54.66, 55.88, 56.12, 56.29, 56.52, 57.02, 57.63, 58.54, 59.37, 62.65,
       62.91, 62.93, 63.22, 64.12, 64.82, 65.06, 65.35, 67.16, 70.6 , 70.94, 72.24, 72.45, 73.87, 74.91, 78.44, 79.17,
       79.78, 79.82, 82.53, 84.96, 85.14, 85.43, 86.19, 86.35, 88.31, 89.76, 90.42, 91.69, 93.11, 96.71, 97.38, 98.32,
       98.33, 98.37, 98.56, 98.83])

freq_1

array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18,
       0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33, 0.34, 0.35, 0.36,
       0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
       0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72,
       0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ,
       0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ])

The function returns all the sorted data and frequencies, independently from the threshold. In the example above, data_1 and data_2 have the same values.

np.array_equal(data_1, data_2), np.array_equal(freq_1, freq_2)

(True, True)

data_2[:last_idx_2+1]

array([ 0.85,  2.1 ,  2.44,  3.22,  3.24,  3.9 ,  4.65,  5.53,  6.08,  6.34,  6.84,  6.98,  8.37,  9.09, 12.58, 12.96,
       13.62, 14.5 , 15.06, 17.11, 17.52, 17.99, 19.12, 22.3 , 22.45, 23.87, 23.87, 24.15, 24.28, 25.95, 26.79, 30.31,
       30.4 , 34.83, 38.01, 38.56, 38.62, 39.47, 39.58, 41.1 , 42.43, 43.57, 43.73, 45.92, 47.19, 47.66, 48.52, 49.03,
       49.6 , 49.98, 51.07, 52.14, 53.06, 53.92, 54.66, 55.88, 56.12, 56.29, 56.52, 57.02, 57.63, 58.54, 59.37, 62.65,
       62.91, 62.93, 63.22, 64.12, 64.82, 65.06, 65.35, 67.16, 70.6 , 70.94, 72.24])

source

cluster_columns

Plot dendogram based on Dataframe’s columns’ spearman correlation coefficients

This function was first seen on fastai repo

feats = [f"Feature_{i:02d}" for i in range(10)]
print('Features:')
print(', '.join(feats))
X = pd.DataFrame(np.random.normal(size=(500, 10)), columns=feats)
cluster_columns(X, (6, 4), 8)

Features:
Feature_00, Feature_01, Feature_02, Feature_03, Feature_04, Feature_05, Feature_06, Feature_07, Feature_08, Feature_09