pipeline

Predifined sklearn pipeline to process and create features with financial data

Custom transforms

Transforms applicable to preprocess and feature engineer financial data.


source

MyBaseTransformer

 MyBaseTransformer ()

Base class for my custom transformers

pipe = Pipeline([
    ('base-transformer', MyBaseTransformer())
])

df = load_test_df()
pipe.fit_transform(df)[:5,:]
array([[ 2759.02,  2779.27,  2747.27,  2754.48, 26562.  ],
       [ 2753.11,  2755.36,  2690.69,  2743.45, 38777.  ],
       [ 2744.83,  2748.58,  2651.23,  2672.8 , 41777.  ],
       [ 2670.8 ,  2722.9 ,  2657.93,  2680.71, 39034.  ],
       [ 2675.59,  2692.34,  2627.59,  2663.57, 61436.  ]])
pd.DataFrame(data=pipe.fit_transform(df), columns=pipe.get_feature_names_out(), index=df.index).head(5)
Open_transformed High_transformed Low_transformed Close_transformed Volume_transformed
dt
2018-10-22 2759.02 2779.27 2747.27 2754.48 26562.0
2018-10-23 2753.11 2755.36 2690.69 2743.45 38777.0
2018-10-24 2744.83 2748.58 2651.23 2672.80 41777.0
2018-10-25 2670.80 2722.90 2657.93 2680.71 39034.0
2018-10-26 2675.59 2692.34 2627.59 2663.57 61436.0

source

ReturnTransformer

 ReturnTransformer (periods:int=1)

Evaluate the percentage return over 1 or more periods

pipe = ColumnTransformer([
    ('r', ReturnTransformer(), ['Open', 'Close', 'Volume'])
])
pd.DataFrame(pipe.fit_transform(df), columns=pipe.get_feature_names_out(), index=df.index).head(5)
r__Open_ret1 r__Close_ret1 r__Volume_ret1
dt
2018-10-22 0.000000 0.000000 0.000000
2018-10-23 -0.002142 -0.004004 0.459867
2018-10-24 -0.003008 -0.025752 0.077365
2018-10-25 -0.026971 0.002959 -0.065658
2018-10-26 0.001793 -0.006394 0.573910

source

StdTransformer

 StdTransformer (window:int=5)

Evaluate the standard deviation over a window

pipe = ColumnTransformer([
    ('std', StdTransformer(3), ['Open', 'Close'])
])
pd.DataFrame(data=pipe.fit_transform(df), columns=pipe.get_feature_names_out(), index=df.index).head(5)
std__Open_std3 std__Close_std3
dt
2018-10-22 NaN NaN
2018-10-23 4.179001 7.799388
2018-10-24 7.127910 44.318367
2018-10-25 45.320958 38.708953
2018-10-26 41.427774 8.578467

source

MATransformer

 MATransformer (window:int=5)

Evaluate the moving average over a window


source

EMATransformer

 EMATransformer (window:int=5)

Evaluate the exponential moving average over a window

Build a pipeline applying these transforms to specific columns.

pipe = ColumnTransformer([
    ('thru', 'passthrough', ['Open', 'High', 'Low', 'Close', 'Volume']),
    ('ret', ReturnTransformer(3), ['Close']),
    ('ma', MATransformer(3), ['Close']),
    ('ema', EMATransformer(3), ['Open', 'Close'])
])
pd.DataFrame(data=pipe.fit_transform(df), columns=pipe.get_feature_names_out(), index=df.index).head(5)
thru__Open thru__High thru__Low thru__Close thru__Volume ret__Close_ret3 ma__Close_MA3 ema__Open_EMA3 ema__Close_EMA3
dt
2018-10-22 2759.02 2779.27 2747.27 2754.48 26562.0 0.000000 2754.480000 2759.020000 2754.480000
2018-10-23 2753.11 2755.36 2690.69 2743.45 38777.0 0.000000 2748.965000 2755.080000 2747.126667
2018-10-24 2744.83 2748.58 2651.23 2672.80 41777.0 0.000000 2723.576667 2749.222857 2704.654286
2018-10-25 2670.80 2722.90 2657.93 2680.71 39034.0 -0.026782 2698.986667 2707.397333 2691.884000
2018-10-26 2675.59 2692.34 2627.59 2663.57 61436.0 -0.029117 2672.360000 2690.980645 2677.270323

source

simplify_colnames

 simplify_colnames (cols)

Simplify the columns names by removing the prefix

print(pipe.get_feature_names_out())
['thru__Open' 'thru__High' 'thru__Low' 'thru__Close' 'thru__Volume'
 'ret__Close_ret3' 'ma__Close_MA3' 'ema__Open_EMA3' 'ema__Close_EMA3']
print(simplify_colnames(pipe.get_feature_names_out()))
['Open', 'High', 'Low', 'Close', 'Volume', 'Close_ret3', 'Close_MA3', 'Open_EMA3', 'Close_EMA3']
df_proc = pd.DataFrame(
    data=pipe.fit_transform(df), 
    columns=simplify_colnames(pipe.get_feature_names_out()), 
    index=df.index)
df_proc.head(5)
Open High Low Close Volume Close_ret3 Close_MA3 Open_EMA3 Close_EMA3
dt
2018-10-22 2759.02 2779.27 2747.27 2754.48 26562.0 0.000000 2754.480000 2759.020000 2754.480000
2018-10-23 2753.11 2755.36 2690.69 2743.45 38777.0 0.000000 2748.965000 2755.080000 2747.126667
2018-10-24 2744.83 2748.58 2651.23 2672.80 41777.0 0.000000 2723.576667 2749.222857 2704.654286
2018-10-25 2670.80 2722.90 2657.93 2680.71 39034.0 -0.026782 2698.986667 2707.397333 2691.884000
2018-10-26 2675.59 2692.34 2627.59 2663.57 61436.0 -0.029117 2672.360000 2690.980645 2677.270323

source

dframe

 dframe (df:pandas.core.frame.DataFrame, pipe:sklearn.base.BaseEstimator,
         fit:bool=True)

Takes a pipeline, (optionaly) fit it on df and transform it, then return a well formated DataFrame

Type Default Details
df DataFrame dataframe to transform
pipe BaseEstimator pipeline to apply on the dataframe
fit bool True fit_transform if True, transform if False
Returns DataFrame formated DataFrame
dframe(df, pipe).head(5)
Open High Low Close Volume Close_ret3 Close_MA3 Open_EMA3 Close_EMA3
dt
2018-10-22 2759.02 2779.27 2747.27 2754.48 26562.0 0.000000 2754.480000 2759.020000 2754.480000
2018-10-23 2753.11 2755.36 2690.69 2743.45 38777.0 0.000000 2748.965000 2755.080000 2747.126667
2018-10-24 2744.83 2748.58 2651.23 2672.80 41777.0 0.000000 2723.576667 2749.222857 2704.654286
2018-10-25 2670.80 2722.90 2657.93 2680.71 39034.0 -0.026782 2698.986667 2707.397333 2691.884000
2018-10-26 2675.59 2692.34 2627.59 2663.57 61436.0 -0.029117 2672.360000 2690.980645 2677.270323

Complex Pipeline Factory

input: OHLCV

parameters:

  • feature creation:
    • coi: list of columns to apply the feature engineering on
    • transforms: list of transforms to apply on each selected column
  • stats applied to new features
    • stats: list of stats to apply on new feature
coi = ['Open', 'Close']
transforms = [ReturnTransformer(1),ReturnTransformer(10)]
stats = [EMATransformer(5), EMATransformer(20), EMATransformer(60), StdTransformer(5)]

source

build_pipeline

 build_pipeline (coi:list[str],
                 transforms:list[sklearn.base.BaseEstimator],
                 stats:list[sklearn.base.BaseEstimator])

*Build a pipeline using cols in coi, transformers in transforms and stats in stats

The output will be a pipeline that takes a OHLCV DataFrame and returns:

  • the original OHLCV columns
  • for the columns in coi, engineers features by applying each of the pipelines in transforms
  • then adds the statistics in stats to each of the engineered features*
Type Details
coi list list of columns to use for feature engineering
transforms list list of feature engineering transformers to apply to each column in coi
stats list list of stats transformers to apply to engineered feature
Returns BaseEstimator
pipe = build_pipeline(coi, transforms, stats)

pipe
ColumnTransformer(transformers=[('thru', 'passthrough',
                                 ['Open', 'High', 'Low', 'Close', 'Volume']),
                                ('return1',
                                 Pipeline(steps=[('return1',
                                                  ReturnTransformer()),
                                                 ('return1stats',
                                                  FeatureUnion(transformer_list=[('thru',
                                                                                  'passthrough'),
                                                                                 ('ema5',
                                                                                  EMATransformer()),
                                                                                 ('ema20',
                                                                                  EMATransformer(window=20)),
                                                                                 ('ema60',
                                                                                  EMATransformer(window=60)),
                                                                                 ('std5',
                                                                                  StdTransformer())]))]),
                                 ['Open', 'Close']),
                                ('return10',
                                 Pipeline(steps=[('return10',
                                                  ReturnTransformer(periods=10)),
                                                 ('return10stats',
                                                  FeatureUnion(transformer_list=[('thru',
                                                                                  'passthrough'),
                                                                                 ('ema5',
                                                                                  EMATransformer()),
                                                                                 ('ema20',
                                                                                  EMATransformer(window=20)),
                                                                                 ('ema60',
                                                                                  EMATransformer(window=60)),
                                                                                 ('std5',
                                                                                  StdTransformer())]))]),
                                 ['Open', 'Close'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
with pandas_nrows_ncols():
    display(dframe(df, pipe).head(5))
Open High Low Close Volume Open_ret1 Close_ret1 Open_ret1_EMA5 Close_ret1_EMA5 Open_ret1_EMA20 Close_ret1_EMA20 Open_ret1_EMA60 Close_ret1_EMA60 Open_ret1_std5 Close_ret1_std5 Open_ret10 Close_ret10 Open_ret10_EMA5 Close_ret10_EMA5 Open_ret10_EMA20 Close_ret10_EMA20 Open_ret10_EMA60 Close_ret10_EMA60 Open_ret10_std5 Close_ret10_std5
dt
2018-10-22 2759.02 2779.27 2747.27 2754.48 26562.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
2018-10-23 2753.11 2755.36 2690.69 2743.45 38777.0 -0.002142 -0.004004 -0.001285 -0.002403 -0.001125 -0.002102 -0.001089 -0.002036 0.001515 0.002832 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-10-24 2744.83 2748.58 2651.23 2672.80 41777.0 -0.003008 -0.025752 -0.002101 -0.013463 -0.001816 -0.010786 -0.001750 -0.010206 0.001548 0.013858 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-10-25 2670.80 2722.90 2657.93 2680.71 39034.0 -0.026971 0.002959 -0.012432 -0.006641 -0.009078 -0.006818 -0.008374 -0.006748 0.012690 0.013019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-10-26 2675.59 2692.34 2627.59 2663.57 61436.0 0.001793 -0.006394 -0.006971 -0.006546 -0.006448 -0.006716 -0.006203 -0.006673 0.011836 0.011275 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0