tubular.capping.CappingTransformer

class tubular.capping.CappingTransformer(capping_values=None, quantiles=None, weights_column=None, **kwargs)[source]

Bases: tubular.base.BaseTransformer

Transformer to cap numeric values at both or either minimum and maximum values.

For max capping any values above the cap value will be set to the cap. Similarly for min capping any values below the cap will be set to the cap. Only works for numeric columns.

Parameters
  • capping_values (dict or None, default = None) – Dictionary of capping values to apply to each column. The keys in the dict should be the column names and each item in the dict should be a list of length 2. Items in the lists should be ints or floats or None. The first item in the list is the minimum capping value and the second item in the list is the maximum capping value. If None is supplied for either value then that capping will not take place for that particular column. Both items in the lists cannot be None. Either one of capping_values or quantiles must be supplied.

  • quantiles (dict or None, default = None) – Dictionary of quantiles in the range [0, 1] to set capping values at for each column. The keys in the dict should be the column names and each item in the dict should be a list of length 2. Items in the lists should be ints or floats or None. The first item in the list is the lower quantile and the second item is the upper quantile to set the capping value from. The fit method calculates the values quantile from the input data X. If None is supplied for either value then that capping will not take place for that particular column. Both items in the lists cannot be None. Either one of capping_values or quantiles must be supplied.

  • weights_column (str or None, default = None) – Optional weights column argument that can be used in combination with quantiles. Not used if capping_values is supplied. Allows weighted quantiles to be calculated.

  • **kwargs – Arbitrary keyword arguments passed onto BaseTransformer.init method.

capping_values

Capping values to apply to each column, capping_values argument.

Type

dict or None

quantiles

Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.

Type

dict or None

weights_column

weights_column argument.

Type

str or None

_replacement_values

Replacement values when capping is applied. Will be a copy of capping_values.

Type

dict

__init__(capping_values=None, quantiles=None, weights_column=None, **kwargs)None[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([capping_values, quantiles, …])

Initialize self.

check_capping_values_dict(…)

Performs checks on a dictionary passed to .

check_is_fitted(attribute)

Check if particular attributes are on the object.

check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

columns_set_or_check(X)

Function to check or set columns attribute.

fit(X[, y])

Learn capping values from input data X.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

prepare_quantiles(values, quantiles[, …])

Method to call the weighted_quantile method and prepare the outputs.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Apply capping to columns in X.

weighted_quantile(values, quantiles[, …])

Method to calculate weighted quantiles.

check_capping_values_dict(capping_values_dict, dict_name)[source]

Performs checks on a dictionary passed to .

check_is_fitted(attribute)

Check if particular attributes are on the object. This is useful to do before running transform to avoid trying to transform data without first running the fit method.

Wrapper for utils.validation.check_is_fitted function.

Parameters

attributes (List) – List of str values giving names of attribute to check exist on self.

static check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

X (pd.DataFrame): df containing weight column weights_column (str): name of weight column

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

Parameters

X (pd.DataFrame) – Data to check columns are in.

columns_set_or_check(X)

Function to check or set columns attribute.

If the columns attribute is None then set it to all columns in X. Otherwise run the columns_check method.

Parameters

X (pd.DataFrame) – Data to check columns are in.

fit(X, y=None)[source]

Learn capping values from input data X.

Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the capping_values attribute.

Parameters
  • X (pd.DataFrame) – A dataframe with required columns to be capped.

  • y (None) – Required for pipeline.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

prepare_quantiles(values, quantiles, sample_weight=None)[source]

Method to call the weighted_quantile method and prepare the outputs.

If there are no None values in the supplied quantiles then the outputs from weighted_quantile are returned as is. If there are then prepare_quantiles removes the None values before calling weighted_quantile and adds them back into the output, in the same position, after calling.

Parameters
  • values (pd.Series or np.array) – A dataframe column with values to calculate quantiles from.

  • quantiles (None) – Weighted quantiles to calculate. Must all be between 0 and 1.

  • sample_weight (pd.Series or np.array or None, default = None) – Sample weights for each item in values, must be the same lenght as values. If not supplied then unit weights will be used.

Returns

interp_quantiles – List containing computed quantiles.

Return type

list

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Apply capping to columns in X.

If cap_value_max is set, any values above cap_value_max will be set to cap_value_max. If cap_value_min is set any values below cap_value_min will be set to cap_value_min. Only works or numeric columns.

Parameters

X (pd.DataFrame) – Data to apply capping to.

Returns

X – Transformed input X with min and max capping applied to the specified columns.

Return type

pd.DataFrame

weighted_quantile(values, quantiles, sample_weight=None)[source]

Method to calculate weighted quantiles.

This method is adapted from the “Completely vectorized numpy solution” answer from user Alleo (https://stackoverflow.com/users/498892/alleo) to the following stackoverflow question; https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy. This method is also licenced under the CC-BY-SA terms, as the original code sample posted to stackoverflow (pre February 1, 2016) was.

Method is similar to numpy.percentile, but supports weights. Supplied quantiles should be in the range [0, 1]. Method calculates cumulative % of weight for each observation, then interpolates between these observations to calculate the desired quantiles. Null values in the observations (values) and 0 weight observations are filtered out before calculating.

Parameters
  • values (pd.Series or np.array) – A dataframe column with values to calculate quantiles from.

  • quantiles (None) – Weighted quantiles to calculate. Must all be between 0 and 1.

  • sample_weight (pd.Series or np.array or None, default = None) – Sample weights for each item in values, must be the same lenght as values. If not supplied then unit weights will be used.

Returns

interp_quantiles – List containing computed quantiles.

Return type

list

Examples

>>> x = CappingTransformer(capping_values={"a": [2, 10]})
>>> quantiles_to_compute = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
>>> computed_quantiles = x.weighted_quantile(values = [1, 2, 3], sample_weight = [1, 1, 1], quantiles = quantiles_to_compute)
>>> [round(q, 1) for q in computed_quantiles]
[1.0, 1.0, 1.0, 1.0, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0]
>>>
>>> computed_quantiles = x.weighted_quantile(values = [1, 2, 3], sample_weight = [0, 1, 0], quantiles = quantiles_to_compute)
>>> [round(q, 1) for q in computed_quantiles]
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]
>>>
>>> computed_quantiles = x.weighted_quantile(values = [1, 2, 3], sample_weight = [1, 1, 0], quantiles = quantiles_to_compute)
>>> [round(q, 1) for q in computed_quantiles]
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
>>>
>>> computed_quantiles = x.weighted_quantile(values = [1, 2, 3, 4, 5], sample_weight = [1, 1, 1, 1, 1], quantiles = quantiles_to_compute)
>>> [round(q, 1) for q in computed_quantiles]
[1.0, 1.0, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
>>>
>>> computed_quantiles = x.weighted_quantile(values = [1, 2, 3, 4, 5], sample_weight = [1, 0, 1, 0, 1], quantiles = [0, 0.5, 1.0])
>>> [round(q, 1) for q in computed_quantiles]
[1.0, 2.0, 5.0]