tubular.nominal.MeanResponseTransformer

class tubular.nominal.MeanResponseTransformer(columns=None, weights_column=None, prior=0, level=None, unseen_level_handling=None, **kwargs)[source]

Bases: tubular.nominal.BaseNominalTransformer, tubular.mapping.BaseMappingTransformMixin

Transformer to apply mean response encoding. This converts categorical variables to numeric by mapping levels to the mean response for that level.

For a continuous or binary response the categorical columns specified will have values replaced with the mean response for each category.

For an n > 1 level categorical response, up to n binary responses can be created, which in turn can then be used to encode each categorical column specified. This will generate up to n * len(columns) new columns, of with names of the form {column}_{response_level}. The original columns will be removed from the dataframe. This functionality is controlled using the ‘level’ parameter. Note that the above only works for a n > 1 level categorical response. Do not use ‘level’ parameter for a n > 1 level numerical response. In this case, use the standard mean response transformer without the ‘level’ parameter.

If a categorical variable contains null values these will not be transformed.

The same weights and prior are applied to each response level in the multi-level case.

Parameters
  • columns (None or str or list, default = None) – Columns to transform, if the default of None is supplied all object and category columns in X are used.

  • weights_column (str or None) – Weights column to use when calculating the mean response.

  • prior (int, default = 0) – Regularisation parameter, can be thought of roughly as the size a category should be in order for its statistics to be considered reliable (hence default value of 0 means no regularisation).

  • level (str, list or None, default = None) – Parameter to control encoding against a multi-level categorical response. For a continuous or binary response, leave this as None. In the multi-level case, set to ‘all’ to encode against every response level or provide a list of response levels to encode against.

  • unseen_level_handling (str("Mean", "Median", "Lowest" or "Highest) or int/float, default = None) – Parameter to control the logic for handling unseen levels of the categorical features to encode in data when using transform method. Default value of None will output error when attempting to use transform on data with unseen levels in categorical columns to encode. Set this parameter to one of the options above in order to encode unseen levels in each categorical column with the mean, median etc. of each column. One can also pass an arbitrary int/float value to use for encoding unseen levels.

  • **kwargs – Arbitrary keyword arguments passed onto BaseTransformer.init method.

columns

Categorical columns to encode in the input data.

Type

str or list

weights_column

Weights column to use when calculating the mean response.

Type

str or None

prior

Regularisation parameter, can be thought of roughly as the size a category should be in order for its statistics to be considered reliable (hence default value of 0 means no regularisation).

Type

int, default = 0

level

Parameter to control encoding against a multi-level categorical response. If None the response will be treated as binary or continous, if ‘all’ all response levels will be encoded against and if it is a list of levels then only the levels specified will be encoded against.

Type

str, list or None, default = None

response_levels

Only created in the mutli-level case. Generated from level, list of all the response levels to encode against.

Type

list

mappings

Created in fit. Dict of key (column names) value (mapping of categorical levels to numeric, mean response values) pairs.

Type

dict

mapped_columns

Only created in the multi-level case. A list of the new columns produced by encoded the columns in self.columns against multiple response levels, of the form {column}_{level}.

Type

list

transformer_dict

Only created in the mutli-level case. A dictionary of the form level : transformer containing the mean response transformers for each level to be encoded against.

Type

dict

unseen_levels_encoding_dict

Dict containing the values (based on chosen unseen_level_handling) derived from the encoded columns to use when handling unseen levels in data passed to transform method.

Type

dict

__init__(columns=None, weights_column=None, prior=0, level=None, unseen_level_handling=None, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([columns, weights_column, prior, …])

Initialize self.

check_is_fitted(attribute)

Check if particular attributes are on the object.

check_mappable_rows(X)

Method to check that all the rows to apply the transformer to are able to be mapped according to the values in the mappings dict.

check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

columns_set_or_check(X)

Function to check or set columns attribute.

fit(X, y)

Identify mapping of categorical levels to mean response values.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform method to apply mean response encoding stored in the mappings attribute to each column in the columns attribute.

check_is_fitted(attribute)

Check if particular attributes are on the object. This is useful to do before running transform to avoid trying to transform data without first running the fit method.

Wrapper for utils.validation.check_is_fitted function.

Parameters

attributes (List) – List of str values giving names of attribute to check exist on self.

check_mappable_rows(X)

Method to check that all the rows to apply the transformer to are able to be mapped according to the values in the mappings dict.

Raises

ValueError – If any of the rows in a column (c) to be mapped, could not be mapped according to the mapping dict in mappings[c].

static check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

X (pd.DataFrame): df containing weight column weights_column (str): name of weight column

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

Parameters

X (pd.DataFrame) – Data to check columns are in.

columns_set_or_check(X)

Function to check or set columns attribute.

If the columns attribute is None then set it to all object and category columns in X. Otherwise run the columns_check method.

Parameters

X (pd.DataFrame) – Data to check columns are in.

fit(X, y)[source]

Identify mapping of categorical levels to mean response values.

If the user specified the weights_column arg in when initialising the transformer the weighted mean response will be calculated using that column.

In the multi-level case this method learns which response levels are present and are to be encoded against.

Parameters
  • X (pd.DataFrame) – Data to with catgeorical variable columns to transform and also containing response_column column.

  • y (pd.Series) – Response variable or target.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Transform method to apply mean response encoding stored in the mappings attribute to each column in the columns attribute.

This method calls the check_mappable_rows method from BaseNominalTransformer to check that all rows can be mapped then transform from BaseMappingTransformMixin to apply the standard pd.Series.map method.

N.B. In the mutli-level case, this method briefly overwrites the self.columns attribute, but sets it back to the original value at the end.

Parameters

X (pd.DataFrame) – Data with nominal columns to transform.

Returns

X – Transformed input X with levels mapped accoriding to mappings dict.

Return type

pd.DataFrame