tubular.nominal.GroupRareLevelsTransformer

class tubular.nominal.GroupRareLevelsTransformer(columns=None, cut_off_percent=0.01, weight=None, rare_level_name='rare', record_rare_levels=True, **kwargs)[source]

Bases: tubular.nominal.BaseNominalTransformer

Transformer to group together rare levels of nominal variables into a new level, labelled ‘rare’ (by default).

Rare levels are defined by a cut off percentage, which can either be based on the number of rows or sum of weights. Any levels below this cut off value will be grouped into the rare level.

Parameters
  • columns (None or str or list, default = None) – Columns to transform, if the default of None is supplied all object and category columns in X are used.

  • cut_off_percent (float, default = 0.01) – Cut off for the percent of rows or percent of weight for a level, levels below this value will be grouped.

  • weight (None or str, default = None) – Name of weights column that should be used so cut_off_percent applies to sum of weights rather than number of rows.

  • rare_level_name (any,default = 'rare'.) – Must be of the same type as columns. Label for the new ‘rare’ level.

  • record_rare_levels (bool, default = False) – If True, an attribute called rare_levels_record_ will be added to the object. This will be a dict of key (column name) value (level from column considered rare according to cut_off_percent) pairs. Care should be taken if working with nominal variables with many levels as this could potentially result in many being stored in this attribute.

  • **kwargs – Arbitrary keyword arguments passed onto BaseTransformer.init method.

cut_off_percent

Cut off percentage (either in terms of number of rows or sum of weight) for a given nominal level to be considered rare.

Type

float

mapping_

Created in fit. A dict of non-rare levels (i.e. levels with more than cut_off_percent weight or rows) that is used to identify rare levels in transform.

Type

dict

rare_level_name

Must be of the same type as columns. Label for the new nominal level that will be added to group together rare levels (as defined by cut_off_percent).

Type

any

record_rare_levels

Should the ‘rare’ levels that will be grouped together be recorded? If not they will be lost after the fit and the only information remaining will be the ‘non’rare’ levels.

Type

bool

rare_levels_record_

Only created (in fit) if record_rare_levels is True. This is dict containing a list of levels that were grouped into ‘rare’ for each column the transformer was applied to.

Type

dict

weight

Name of weights columns to use if cut_off_percent should be in terms of sum of weight not number of rows.

Type

str

__init__(columns=None, cut_off_percent=0.01, weight=None, rare_level_name='rare', record_rare_levels=True, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([columns, cut_off_percent, weight, …])

Initialize self.

check_is_fitted(attribute)

Check if particular attributes are on the object.

check_mappable_rows(X)

Method to check that all the rows to apply the transformer to are able to be mapped according to the values in the mappings dict.

check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

columns_set_or_check(X)

Function to check or set columns attribute.

fit(X[, y])

Records non-rare levels for categorical variables.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Grouped rare levels together into a new ‘rare’ level.

check_is_fitted(attribute)

Check if particular attributes are on the object. This is useful to do before running transform to avoid trying to transform data without first running the fit method.

Wrapper for utils.validation.check_is_fitted function.

Parameters

attributes (List) – List of str values giving names of attribute to check exist on self.

check_mappable_rows(X)

Method to check that all the rows to apply the transformer to are able to be mapped according to the values in the mappings dict.

Raises

ValueError – If any of the rows in a column (c) to be mapped, could not be mapped according to the mapping dict in mappings[c].

static check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

X (pd.DataFrame): df containing weight column weights_column (str): name of weight column

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

Parameters

X (pd.DataFrame) – Data to check columns are in.

columns_set_or_check(X)

Function to check or set columns attribute.

If the columns attribute is None then set it to all object and category columns in X. Otherwise run the columns_check method.

Parameters

X (pd.DataFrame) – Data to check columns are in.

fit(X, y=None)[source]

Records non-rare levels for categorical variables.

When transform is called, only levels records in mapping_ during fit will remain unchanged - all other levels will be grouped. If record_rare_levels is True then the rare levels will also be recorded.

The label for the rare levels must be of the same type as the columns.

Parameters
  • X (pd.DataFrame) – Data to identify non-rare levels from.

  • y (None or pd.DataFrame or pd.Series, default = None) – Optional argument only required for the transformer to work with sklearn pipelines.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Grouped rare levels together into a new ‘rare’ level.

Parameters

X (pd.DataFrame) – Data to with catgeorical variables to apply rare level grouping to.

Returns

X – Transformed input X with rare levels grouped for into a new rare level.

Return type

pd.DataFrame