tubular.numeric.PCATransformer

class tubular.numeric.PCATransformer(columns, n_components=2, svd_solver='auto', random_state=None, pca_column_prefix='pca_', **kwargs)[source]

Bases: tubular.base.BaseTransformer

Transformer that generates variables using Principal component analysis (PCA). Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

It is based on sklearn class sklearn.decomposition.PCA

Parameters
  • columns (None or list or str) – Columns to apply the transformer to. If a str is passed this is put into a list. Value passed in columns is saved in the columns attribute on the object. Note this has no default value so the user has to specify the columns when initialising the transformer. When the user forget to set columns, all columns would be picked up when super transform runs.

  • n_components (int, float or 'mle', default=None) –

    Number of components to keep. if n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'. If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == 'arpack', the number of components must be

    strictly less than the minimum of n_features and n_samples.

    Hence, the None case results in::

    n_components == min(n_samples, n_features) - 1 svd_solver=’auto’, tol=0.0, n_oversamples=10, random_state=None

  • svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') –

    If auto :

    The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

    If full :

    run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

    If arpack :

    run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

    If randomized :

    run randomized SVD by the method of Halko et al.

  • random_state (int, RandomState instance or None, default=None) – Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls. .. sklearn versionadded:: 0.18.0

  • pca_column_prefix (str, prefix added to each the n components features generated. Default is “pca_”) – example: if n_components = 3, new columns would be ‘pca_0’,’pca_1’,’pca_2’.

pca
Type

PCA class from sklearn.decomposition

n_components_

The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.

Type

int

feature_names_out

list of feature name representing the new dimensions.

Type

list or None

__init__(columns, n_components=2, svd_solver='auto', random_state=None, pca_column_prefix='pca_', **kwargs)None[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(columns[, n_components, …])

Initialize self.

check_is_fitted(attribute)

Check if particular attributes are on the object.

check_numeric_columns(X)

Method to check all columns (specicifed in self.columns) in X are all numeric.

check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

columns_set_or_check(X)

Function to check or set columns attribute.

fit(X[, y])

Fit PCA to input data.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Generate from input pandas DataFrame (X) PCA features and add this column or columns in X.

check_is_fitted(attribute)

Check if particular attributes are on the object. This is useful to do before running transform to avoid trying to transform data without first running the fit method.

Wrapper for utils.validation.check_is_fitted function.

Parameters

attributes (List) – List of str values giving names of attribute to check exist on self.

check_numeric_columns(X)[source]

Method to check all columns (specicifed in self.columns) in X are all numeric.

Parameters

X (pd.DataFrame) – Data containing columns to check.

static check_weights_column(X, weights_column)

Helper method for validating weights column in dataframe.

X (pd.DataFrame): df containing weight column weights_column (str): name of weight column

classname()

Method that returns the name of the current class when called.

columns_check(X)

Method to check that the columns attribute is set and all values are present in X.

Parameters

X (pd.DataFrame) – Data to check columns are in.

columns_set_or_check(X)

Function to check or set columns attribute.

If the columns attribute is None then set it to all columns in X. Otherwise run the columns_check method.

Parameters

X (pd.DataFrame) – Data to check columns are in.

fit(X, y=None)[source]

Fit PCA to input data.

Parameters
  • X (pd.DataFrame) – Dataframe with columns to learn scaling values from.

  • y (None) – Required for pipeline.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Generate from input pandas DataFrame (X) PCA features and add this column or columns in X.

Parameters

X (pd.DataFrame) – Data to transform.

Returns

X – Input X with additional column or columns (self.interaction_colname) added. These contain the output of running the product pandas DataFrame method on identified combinations.

Return type

pd.DataFrame