Quick Start

logo

Welcome to the quick start guide for tubular!

Installation

The easiest way to get tubular is to install directly from pypi;

 pip install tubular


Thanks for installing tubular! We hope you find it useful!

Examples

There are example notebooks available on Github that demonstrate the functionality of each transformer.

To open them in binder click here. Once binder has loaded, click on the directory button in the side bar to the left and navigate to the notebook of interest.

Transformers summary

Each of the modules in tubular contains transformers that deal with a specific type of data or problem.

We are always looking for new functionality to improve the package so if you would like to add a new transformer create a pull request to let us know your idea then have a look at the contributing guide.

Base

This module contains the DataFrameMethodTransformer which allows any pandas.DataFrame method to be used in a transformer.

So for example if the user wishes to take the product of some columns they can use this transformer with the pandas.DataFrame.prod method to achieve this.

This transformer saves us from implementing many transformations that are available already in pandas in our package.

Capping

This module deals with capping of numeric columns.

The standard CappingTransformer can apply capping at min and max values for different columns.

The standard OutOfRangeNullTransformer works in a similar way but replaces values outside the cap range with null values rather than the min or max depending on which side they fall.

Dates

This module contains transformers to deal with datetime columns.

Date differencing is available - accounting for leap years DateDiffLeapYearTransformer or not DateDifferenceTransformer.

The BetweenDatesTransformer calculates if one date falls between two others.

The ToDatetimeTransformer converts columns to datetime type.

The SeriesDtMethodTransformer allows the user to use pandas.Series.dt methods in a similar way to base.DataFrameMethodTransformer.

The DatetimeInfoExtractor allows the user to extract datetime info such as the time of day or month from a datetime field.

The DatetimeSinusoidCalculator derives a feature in a dataframe by calculating the sine or cosine of a datetime column.

Imputers

This module contains standard imputation techniques - mean, median mode as well as NearestMeanResponseImputer which imputes with the value which is closest to the null values in terms of average response. All of these support weights.

The NullIndicator is used to create binary indicators of where null values are present in a column.

Mapping

This module contains transformers that deal with explicit mappings of values.

The MappingTransformer deals with standard mapping of one set of values to another.

The CrossColumnMappingTransformer, CrossColumnAddTransformer and CrossColumnMultiplyTransformer apply mapping, addition or multiplication to values in one column based off values in another.

Misc

The misc module contains transformers which do not fit into other categories.

SetValueTransformer creates a constant column with arbitrary value.

SetDtype allows the user to set the dtype of a column.

Nominal

This module contains categorical encoding techniques.

There are respone encoding techniques such as MeanResponseTransformer, one hot encoding OneHotEncodingTransformer and grouping of infrequently occuring levels GroupRareLevelsTransformer.

MeanResponseTransformer also supports regularisation of encodings using a prior.

Numeric

This module contains numeric transformations - cut CutTransformer, log LogTransformer, and scaling ScalingTransformer.

TwoColumnOperatorTransformer allows a user to apply operations to two colmns using methods from pandas.DataFrame method which require a multiple columns (e.g. add, subtract, multiply etc

It also contains InteractionTransformer and PCATransformer which create interaction terms and pca components.

Strings

The strings module contains useful transformers for working with strings. SeriesStrMethodTransformer, allows the user to access pandas.Series.str methods within tubular. StringConcatenator allows a user to concatenate multiple columns together of varied dtype into a string output.

Reporting an issue

If you find an issue or bug in the package please create an issue on github.

We really appreciate the time anyone takes to file an issue as this helps us improve the packge.