tubular package
Submodules
tubular.base module
Contains transformers that other transformers in the package inherit from.
These transformers contain key checks to be applied in all cases.
- class tubular.base.BaseTransformer(columns: ]] | str, copy: bool = False, verbose: bool = False, return_native: bool = True)[source]
Bases:
BaseEstimator,TransformerMixinBase transformer class which all other transformers in the package inherit from.
Provides fit and transform methods (required by sklearn transformers), simple input checking and functionality to copy X prior to transform.
Attributes:
- columnslist
Either a list of str values giving which columns in a input pandas.DataFrame the transformer will be applied to.
- copybool
Should X be copied before transforms are applied? Copy argument no longer used and will be deprecated in a future release
- verbosebool
Print statements to show which methods are being run or not.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- return_native: bool, default = True
Controls whether transformer returns narwhals or native pandas/polars type
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> BaseTransformer( … columns=”a”, … ) BaseTransformer(columns=[‘a’])
- FITS = True
- check_is_fitted(attribute: str) None[source]
Check if particular attributes are on the object.
This is useful to do before running transform to avoid trying to transform data without first running the fit method.
Wrapper for utils.validation.check_is_fitted function.
- Parameters:
attribute (List) – List of str values giving names of attribute to check exist on self.
Example
```pycon >>> transformer = BaseTransformer( … columns=”a”, … )
>>> transformer.check_is_fitted("columns")
- classname() str[source]
Return the name of the current class when called.
- Returns:
str
- Return type:
name of class
- columns_check(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) None[source]
Check that the columns attribute is set and all values are present in X.
- Parameters:
X (DataFrame) – Data to check columns are in.
- Raises:
ValueError – if columns missing from dataframe:
Examples
```pycon >>> import polars as pl >>> transformer = BaseTransformer( … columns=”a”, … )
>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> transformer.columns_check(df)
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) BaseTransformer[source]
Check data before fit.
Fit calls the columns_check method which will check that the columns attribute is set and all values are present in X
- Parameters:
X (DataFrame) – Data to fit the transformer on.
y (None or Series or LazyFrame, default = None) – Optional argument only required for the transformer to work with sklearn pipelines.
- Returns:
BaseTransformer
- Return type:
returns self
Examples
```pycon >>> import polars as pl >>> transformer = BaseTransformer( … columns=”a”, … ) >>> df = pl.DataFrame({“a”: [1, 2], “b”: [3, 4]}) >>> transformer.fit(df) BaseTransformer(columns=[‘a’])
- classmethod from_json(json: dict[str, Any]) BaseTransformer[source]
Rebuild transformer from json dict, readyfor transform.
- Parameters:
json (dict[str, dict[str, Any]]) – json-ified transformer
- Returns:
reconstructed transformer class, ready for transform
- Return type:
- Raises:
RuntimeError – if transformer does not have to/from json: functionality enabled
Examples
```pycon >>> json_dict = {“init”: {“columns”: [“a”, “b”]}, “fit”: {}}
>>> BaseTransformer.from_json(json=json_dict) BaseTransformer(columns=['a', 'b'])
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
Child classes will need to overload this method if their behaviour is more complex than just returning the input columns.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = BaseTransformer( … columns=”a”, … )
>>> transformer.get_feature_names_out() ['a']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- set_transform_request(*, return_native_override: bool | None | str = '$UNCHANGED$') BaseTransformer
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
return_native_override (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
return_native_overrideparameter intransform.- Returns:
self – The updated object.
- Return type:
object
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
- Raises:
RuntimeError – if transformer does not have to/from json functionality: enabled
Examples
```pycon >>> transformer = BaseTransformer(columns=[“a”, “b”])
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'BaseTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True}, 'fit': {'is_fitted_': False}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Check data before child transform.
Transform calls the columns_check method which will check columns in columns attribute are in X.
- Parameters:
X (DataFrame) – Data to transform with the transformer.
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Input X, copied if specified by user.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> transformer = BaseTransformer( … columns=”a”, … )
>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> transformer.transform(df) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
- class tubular.base.DataFrameMethodTransformer(**kwargs)[source]
Bases:
DropOriginalMixin,BaseTransformerTransformer that applies a pandas.DataFrame method.
Transformer assigns the output of the method to a new column or columns. It is possible to supply other key word arguments to the transform method, which will be passed to the pandas.DataFrame method being called.
Be aware it is possible to supply incompatible arguments to init that will only be identified when transform is run. This is because there are many combinations of method, input and output sizes. Additionally some methods may only work as expected when called in transform with specific key word arguments.
- new_column_names
The name of the column or columns to be assigned to the output of running the pandas method in transform.
- Type:
str or list of str
- pd_method_name
The name of the pandas.DataFrame method to call.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Transform input data.
Uses the given pandas.DataFrame method and assign the output back to column or columns in X.
Any keyword arguments set in the pd_method_kwargs attribute are passed onto the pandas DataFrame method when calling it.
- Parameters:
X (pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional column or columns (self.new_column_names) added. These contain the output of running the pandas DataFrame method.
- Return type:
pd.DataFrame
- tubular.base.register(cls: BaseTransformer) BaseTransformer[source]
Add transformer to registry dict.
Returns:
cls - transformer
Example:
```pycon >>> @register … class MyTransformer(BaseTransformer): … pass … >>> CLASS_REGISTRY[“MyTransformer”] <class ‘tubular.base.MyTransformer’>
tubular.capping module
Contains transformers that apply capping to numeric columns.
- class tubular.capping.BaseCappingTransformer(capping_values: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, quantiles: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseNumericTransformer,WeightColumnMixinBase class for capping transformers, contains functionality shared across capping transformer classes.
- capping_values
Capping values to apply to each column, capping_values argument.
- Type:
dict[str, CappingValues] or None
- quantiles
Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.
- Type:
dict[str, CappingValues] or None
- quantile_capping_values
Capping values learned from quantiles (if provided) to apply to each column.
- Type:
dict[str, CappingValues] or None
- weights_column
weights_column argument.
- Type:
str or None
- _replacement_values
Replacement values when capping is applied. Will be a copy of capping_values.
- Type:
dict[str, CappingValues]
- built_from_json
- Type:
bool
- indicates if transformer was reconstructed from json, which limits it's supported
- functionality to .transform
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- FITS = True
- check_capping_values_dict(capping_values_dict: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]], dict_name: str) None[source]
Check passed dictionary.
- Parameters:
capping_values_dict (dict[str, float]) – dict of form {column_name: [lower_cap, upper_cap]}
dict_name (str) – ‘capping_values’ or ‘quantiles’
- Raises:
ValueError – if capping values are invalid, e.g. lower_cap>upper_cap:
Examples
```pycon >>> transformer = BaseCappingTransformer( … capping_values={“a”: [10, 20], “b”: [1, 3]}, … )
>>> transformer.check_capping_values_dict(transformer.capping_values, "capping_values")
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) BaseCappingTransformer[source]
Learn capping values from input data X.
Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the quantile_capping_values and replacement_values attributes.
- Parameters:
X (DataFrame) – A dataframe with required columns to be capped.
y (Series or LazyFrame or None. Defaults to None) – Required for pipeline.
- Returns:
BaseCappingTransformer
- Return type:
fitted instance of class
Examples
```pycon >>> import polars as pl
>>> transformer = BaseCappingTransformer( ... quantiles={"a": [0.01, 0.99], "b": [0.05, 0.95]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]}) >>> test_target = pl.Series(name="target", values=[5, 6, 7, 8])
>>> transformer.fit(test_df, test_target) BaseCappingTransformer(quantiles={'a': [0.01, 0.99], 'b': [0.05, 0.95]})
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[source]
Return a JSON-serializable representation of the transformer.
- Return type:
dict
Dictionary containing all necessary attributes to recreate the transformer with from_json. Keys include ‘init’ (initialization parameters) and ‘fit’ (fitted values).
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply capping to columns in X.
If cap_value_max is set, any values above cap_value_max will be set to cap_value_max. If cap_value_min is set any values below cap_value_min will be set to cap_value_min. Only works or numeric columns.
- Parameters:
X (DataFrame) – Data to apply capping to.
return_native_override (Optional[bool]) – Option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Transformed input X with min and max capping applied to the specified columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = BaseCappingTransformer( ... capping_values={"a": [10, 20], "b": [1, 3]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.transform(test_df) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 10 ┆ 3 ┆ 1 │ │ 15 ┆ 2 ┆ 2 │ │ 18 ┆ 3 ┆ 3 │ │ 20 ┆ 1 ┆ 4 │ └─────┴─────┴─────┘
- weighted_quantile(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, quantiles: list[int | float], values_column: str, weights_column: str) list[int | float | None][source]
Calculate weighted quantiles.
This method is adapted from the “Completely vectorized numpy solution” answer from user Alleo (https://stackoverflow.com/users/498892/alleo) to the following stackoverflow question; https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy. This method is also licenced under the CC-BY-SA terms, as the original code sample posted to stackoverflow (pre February 1, 2016) was.
Method is similar to numpy.percentile, but supports weights. Supplied quantiles should be in the range [0, 1]. Method calculates cumulative % of weight for each observation, then interpolates between these observations to calculate the desired quantiles. Null values in the observations (values) and 0 weight observations are filtered out before calculating.
- Parameters:
X (DataFrame) – Dataframe with relevant columns to calculate quantiles from.
quantiles (list[Number]) – Weighted quantiles to calculate. Must all be between 0 and 1.
values_column (str) – name of relevant values column in data
weights_column (str) – name of relevant weight column in data
- Returns:
interp_quantiles – List containing computed quantiles.
- Return type:
list[Number]
Examples
```pycon >>> import polars as pl >>> x = CappingTransformer(capping_values={“a”: [2, 10]}) >>> df = pl.DataFrame({“a”: [1, 2, 3], “weight”: [1, 1, 1]}) >>> quantiles_to_compute = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] >>> computed_quantiles = x.weighted_quantile( … X=df, values_column=”a”, weights_column=”weight”, quantiles=quantiles_to_compute … ) >>> [round(q, 1) for q in computed_quantiles] [1.0, 1.0, 1.0, 1.0, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0]
>>> df = pl.DataFrame({"a": [1, 2, 3], "weight": [0, 1, 0]}) >>> computed_quantiles = x.weighted_quantile( ... X=df, values_column="a", weights_column="weight", quantiles=quantiles_to_compute ... ) >>> [round(q, 1) for q in computed_quantiles] [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]
>>> df = pl.DataFrame({"a": [1, 2, 3], "weight": [1, 1, 0]}) >>> computed_quantiles = x.weighted_quantile( ... X=df, values_column="a", weights_column="weight", quantiles=quantiles_to_compute ... ) >>> [round(q, 1) for q in computed_quantiles] [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
>>> df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "weight": [1, 1, 1, 1, 1]}) >>> computed_quantiles = x.weighted_quantile( ... X=df, values_column="a", weights_column="weight", quantiles=quantiles_to_compute ... ) >>> [round(q, 1) for q in computed_quantiles] [1.0, 1.0, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
>>> df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "weight": [1, 0, 1, 0, 1]}) >>> computed_quantiles = x.weighted_quantile( ... X=df, values_column="a", weights_column="weight", quantiles=[0, 0.5, 1.0] ... ) >>> [round(q, 1) for q in computed_quantiles] [1.0, 2.0, 5.0]
- class tubular.capping.CappingTransformer(capping_values: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, quantiles: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseCappingTransformerTransformer to cap numeric values at both or either minimum and maximum values.
For max capping any values above the cap value will be set to the cap. Similarly for min capping any values below the cap will be set to the cap. Only works for numeric columns.
Attributes:
- capping_valuesdict[str, CappingValues] or None
Capping values to apply to each column, capping_values argument.
- quantilesdict[str, CappingValues] or None
Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.
- quantile_capping_valuesdict[str, CappingValues] or None
Capping values learned from quantiles (if provided) to apply to each column.
- weights_columnstr or None
weights_column argument.
- _replacement_valuesdict[str, CappingValues]
Replacement values when capping is applied. Will be a copy of capping_values.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> import polars as pl
>>> transformer = CappingTransformer( ... capping_values={"a": [10, 20], "b": [1, 3]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.transform(test_df) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 10 ┆ 3 ┆ 1 │ │ 15 ┆ 2 ┆ 2 │ │ 18 ┆ 3 ┆ 3 │ │ 20 ┆ 1 ┆ 4 │ └─────┴─────┴─────┘
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'CappingTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'capping_values': {'a': [10, 20], 'b': [1, 3]}, 'quantiles': None, 'weights_column': None}, 'fit': {'is_fitted_': False}}
>>> CappingTransformer.from_json(json_dump) CappingTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) CappingTransformer[source]
Learn capping values from input data X.
Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the capping_values attribute.
- Parameters:
X (DataFrame) – A dataframe with required columns to be capped.
y (None) – Required for pipeline.
- Returns:
CappingTransformer
- Return type:
fitted instance of class
Example
```pycon >>> import polars as pl
>>> transformer = CappingTransformer( ... quantiles={"a": [0.01, 0.99], "b": [0.05, 0.95]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.fit(test_df) CappingTransformer(quantiles={'a': [0.01, 0.99], 'b': [0.05, 0.95]})
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.capping.OutOfRangeNullTransformer(capping_values: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, quantiles: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseCappingTransformerTransformer to set values outside of a range to null.
This transformer sets the cut off values in the same way as the CappingTransformer. So either the user can specify them directly in the capping_values argument or they can be calculated in the fit method, if the user supplies the quantiles argument.
Attributes:
- capping_valuesdict[str, CappingValues] or None
Capping values to apply to each column, capping_values argument.
- quantilesdict[str, CappingValues] or None
Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.
- quantile_capping_valuesdict[str, CappingValues] or None
Capping values learned from quantiles (if provided) to apply to each column.
- weights_columnstr or None
weights_column argument.
- _replacement_valuesdict[str, CappingValues]
Replacement values when capping is applied. This will contain nulls for each column.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> import polars as pl
>>> transformer = OutOfRangeNullTransformer( ... capping_values={"a": [10, 20], "b": [1, 3]}, ... ) >>> transformer OutOfRangeNullTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
# transform method is inherited so also demo that here >>> test_df = pl.DataFrame()
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.transform(test_df) shape: (4, 3) ┌──────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞══════╪══════╪═════╡ │ null ┆ null ┆ 1 │ │ 15 ┆ 2 ┆ 2 │ │ 18 ┆ null ┆ 3 │ │ null ┆ 1 ┆ 4 │ └──────┴──────┴─────┘
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'OutOfRangeNullTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'capping_values': {'a': [10, 20], 'b': [1, 3]}, 'quantiles': None, 'weights_column': None}, 'fit': {'is_fitted_': False}}
>>> OutOfRangeNullTransformer.from_json(json_dump) OutOfRangeNullTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) OutOfRangeNullTransformer[source]
Learn capping values from input data X.
Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the capping_values attribute.
- Parameters:
X (DataFrame) – A dataframe with required columns to be capped.
y (None) – Required for pipeline.
- Returns:
OutOfRangeNullTransformer
- Return type:
fitted instance of class
Example
```pycon >>> import polars as pl
>>> transformer = OutOfRangeNullTransformer( ... quantiles={"a": [0.01, 0.99], "b": [0.05, 0.95]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.fit(test_df) OutOfRangeNullTransformer(quantiles={'a': [0.01, 0.99], 'b': [0.05, 0.95]})
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- static set_replacement_values(capping_values: dict[str, list[int | float | None]]) dict[str, list[bool | None]][source]
Set the _replacement_values to have all null values.
Keeps the existing keys in the _replacement_values dict and sets all values (except None) in the lists to np.NaN. Any None values remain in place.
- Returns:
replacement_values
- Return type:
replacement values for OutOfRangeNullTransformer
Examples
```pycon >>> import polars as pl
>>> capping_values = {"a": [0.1, 0.2], "b": [None, 10]}
>>> OutOfRangeNullTransformer.set_replacement_values(capping_values) {'a': [None, None], 'b': [False, None]}
tubular.comparison module
module for comparing and conditionally updating provided columns.
- class tubular.comparison.CompareTwoColumnsTransformer(columns: ]], condition: ]], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer to compare two columns and generate outcomes based on conditions.
This transformer evaluates a condition between two columns and generates an outcome based on the result.
- polars_compatible
Indicates whether transformer has been converted to polars/pandas agnostic narwhals framework.
- Type:
bool
- FITS
Indicates whether transform requires fit to be run first.
- Type:
bool
- jsonable
Indicates if transformer supports to/from_json methods.
- Type:
bool
- lazyframe_compatible
Indicates whether transformer works with lazyframes.
- Type:
bool
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2, 3], “b”: [3, 2, 1]}) >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=”>”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 3) ┌─────┬─────┬───────┐ │ a ┆ b ┆ a>b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool │ ╞═════╪═════╪═══════╡ │ 1 ┆ 3 ┆ false │ │ 2 ┆ 2 ┆ false │ │ 3 ┆ 1 ┆ true │ └─────┴─────┴───────┘
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- ops_map: ClassVar[dict[ConditionEnum, Any]] = {ConditionEnum.EQUAL_TO: <built-in function eq>, ConditionEnum.GREATER_THAN: <built-in function gt>, ConditionEnum.LESS_THAN: <built-in function lt>, ConditionEnum.NOT_EQUAL_TO: <built-in function ne>}
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=ConditionEnum.GREATER_THAN.value, … ) >>> json_dict = transformer.to_json() >>> from pprint import pprint >>> pprint(json_dict, sort_dicts=True) {‘classname’: ‘CompareTwoColumnsTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’, ‘b’],
‘condition’: ‘>’, ‘copy’: False, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform two columns based on a condition to generate an outcome.
- Parameters:
X (DataFrame) – DataFrame containing the columns to be transformed.
- Returns:
Transformed DataFrame with the new outcome column.
- Return type:
DataFrame
- Raises:
TypeError – If the columns are not of a numeric type.
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2, 3], “b”: [3, 2, 1]}) >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=”>”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 3) ┌─────┬─────┬───────┐ │ a ┆ b ┆ a>b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool │ ╞═════╪═════╪═══════╡ │ 1 ┆ 3 ┆ false │ │ 2 ┆ 2 ┆ false │ │ 3 ┆ 1 ┆ true │ └─────┴─────┴───────┘
- class tubular.comparison.ConditionEnum(*values)[source]
Bases:
EnumEnumeration of comparison conditions.
- EQUAL_TO = '=='
- GREATER_THAN = '>'
- LESS_THAN = '<'
- NOT_EQUAL_TO = '!='
- class tubular.comparison.EqualityChecker(**kwargs)[source]
Bases:
DropOriginalMixin,BaseTransformerTransformer to check if two columns are equal.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- get_feature_names_out() list[str][source]
Get list of features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> # base classes just return inputs >>> transformer = EqualityChecker( … columns=[“a”, “b”], … new_column_name=”bla”, … )
>>> transformer.get_feature_names_out() ['bla']
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.comparison.WhenThenOtherwiseTransformer(columns: ]], when_column: str, then_column: str, **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer to apply conditional logic across multiple columns.
This transformer evaluates specified columns against a condition and updates with given values based on the results.
- polars_compatible
Indicates whether transformer has been converted to polars/pandas agnostic narwhals framework.
- Type:
bool
- FITS
Indicates whether transform requires fit to be run first.
- Type:
bool
- jsonable
Indicates if transformer supports to/from_json methods.
- Type:
bool
- lazyframe_compatible
Indicates whether transformer works with lazyframes.
- Type:
bool
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame( … { … “a”: [1, 2, 3], … “b”: [4, 5, 6], … “condition_col”: [True, False, True], … “update_col”: [10, 20, 30], … } … ) >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], when_column=”condition_col”, then_column=”update_col” … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 4) ┌─────┬─────┬───────────────┬────────────┐ │ a ┆ b ┆ condition_col ┆ update_col │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ i64 │ ╞═════╪═════╪═══════════════╪════════════╡ │ 10 ┆ 10 ┆ true ┆ 10 │ │ 2 ┆ 5 ┆ false ┆ 20 │ │ 30 ┆ 30 ┆ true ┆ 30 │ └─────┴─────┴───────────────┴────────────┘
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], … when_column=”condition_col”, … then_column=”update_col”, # noqa: E501 … ) >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘WhenThenOtherwiseTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’, ‘b’],
‘copy’: False, ‘return_native’: True, ‘then_column’: ‘update_col’, ‘verbose’: False, ‘when_column’: ‘condition_col’},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply conditional logic to transform specified columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to be transformed.
- Returns:
Transformed DataFrame with updated columns based on conditions.
- Return type:
DataFrame
- Raises:
TypeError – If the when_column is not of type Boolean or if columns have mismatched types.
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame( … { … “a”: [1, 2, 3], … “b”: [4, 5, 6], … “condition_col”: [True, False, True], … “update_col”: [10, 20, 30], … } … ) >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], … when_column=”condition_col”, … then_column=”update_col”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 4) ┌─────┬─────┬───────────────┬────────────┐ │ a ┆ b ┆ condition_col ┆ update_col │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ i64 │ ╞═════╪═════╪═══════════════╪════════════╡ │ 10 ┆ 10 ┆ true ┆ 10 │ │ 2 ┆ 5 ┆ false ┆ 20 │ │ 30 ┆ 30 ┆ true ┆ 30 │ └─────┴─────┴───────────────┴────────────┘
tubular.dates module
Contains transformers for working with date columns.
- class tubular.dates.BaseDatetimeTransformer(columns: list[str] | str, new_column_name: str, drop_original: bool = False, **kwargs: bool | None)[source]
Bases:
BaseGenericDateTransformerExtends BaseTransformer for datetime scenarios.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> BaseDatetimeTransformer( … columns=[“a”, “b”], … new_column_name=”bla”, … ) BaseDatetimeTransformer(columns=[‘a’, ‘b’], new_column_name=’bla’)
- FITS = False
- jsonable = False
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Check types of selected columns in provided data.
- Parameters:
X (DataFrame) – Data containing self.columns
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X (DataFrame) – Validated data
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = BaseDatetimeTransformer(
… columns=[“a”, “b”],
… new_column_name=”bla”,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([datetime.datetime(1993, 9, 27), datetime.datetime(2005, 10, 7)],)
… “b” ([datetime.datetime(1991, 5, 22), datetime.datetime(2001, 12, 10)],)
… },
… )
>>> # base transform has no effect on data
>>> transformer.transform(test_df)
shape ((2, 2))
┌─────────────────────┬─────────────────────┐
│ a ┆ b │
│ — ┆ — │
│ datetime[μs] ┆ datetime[μs] │
╞═════════════════════╪═════════════════════╡
│ 1993-09-27 00 (00:00 ┆ 1991-05-22 00:00:00 │)
│ 2005-10-07 00 (00:00 ┆ 2001-12-10 00:00:00 │)
└─────────────────────┴─────────────────────┘
- class tubular.dates.BaseGenericDateTransformer(columns: list[str] | str, new_column_name: str, drop_original: bool = False, **kwargs: bool | None)[source]
Bases:
DropOriginalMixin,BaseTransformerExtends BaseTransformer for datetime/date scenarios.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- return_native: bool, default = True
Controls whether transformer returns narwhals or native pandas/polars type
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> BaseGenericDateTransformer( … columns=[“a”, “b”], … new_column_name=”bla”, … ) BaseGenericDateTransformer(columns=[‘a’, ‘b’], new_column_name=’bla’)
- FITS = False
- check_columns_are_date_or_datetime(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, datetime_only: bool) None[source]
Check types of provided columns.
Columns must be datetime or date type, depending on the datetime_only flag. If a column does not meet the expected type criteria, a TypeError is raised.
- Parameters:
X (DataFrame) – Data to validate
datetime_only (bool) – Indicates whether ONLY datetime types are accepted
- Raises:
TypeError – if non date/datetime types are found:
TypeError – if mismatched date/datetime types are found,:
types should be consistent –
Examples
```pycon >>> import polars as pl
>>> transformer = BaseGenericDateTransformer( ... columns=["a", "b"], ... new_column_name="bla", ... )
>>> test_df = pl.DataFrame( ... { ... "a": [datetime.date(1993, 9, 27), datetime.date(2005, 10, 7)], ... "b": [datetime.date(1991, 5, 22), datetime.date(2001, 12, 10)], ... }, ... )
>>> transformer.check_columns_are_date_or_datetime(test_df, datetime_only=False)
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> # base classes just return inputs >>> transformer = BaseGenericDateTransformer( … columns=[“a”, “b”], … new_column_name=”bla”, … )
>>> transformer.get_feature_names_out() ['a', 'b']
>>> # other classes return new columns >>> transformer = DateDifferenceTransformer( ... columns=["a", "b"], ... new_column_name="bla", ... )
>>> transformer.get_feature_names_out() ['bla']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = BaseGenericDateTransformer(columns=[“a”, “b”], new_column_name=”bla”)
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'BaseGenericDateTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'bla', 'drop_original': False}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, datetime_only: bool = False, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Validate data pre transform.
- Parameters:
X (DataFrame) – Data containing self.columns
datetime_only (bool) – Indicates whether ONLY datetime types are accepted
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Validated data
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> import datetime
>>> transformer = BaseGenericDateTransformer( ... columns=["a", "b"], ... new_column_name="bla", ... )
>>> test_df = pl.DataFrame( ... { ... "a": [datetime.date(1993, 9, 27), datetime.date(2005, 10, 7)], ... "b": [datetime.date(1991, 5, 22), datetime.date(2001, 12, 10)], ... }, ... )
>>> # base transform has no effect on data >>> transformer.transform(test_df) shape: (2, 2) ┌────────────┬────────────┐ │ a ┆ b │ │ --- ┆ --- │ │ date ┆ date │ ╞════════════╪════════════╡ │ 1993-09-27 ┆ 1991-05-22 │ │ 2005-10-07 ┆ 2001-12-10 │ └────────────┴────────────┘
- class tubular.dates.BetweenDatesTransformer(columns: ]], new_column_name: str, drop_original: bool = False, lower_inclusive: bool = True, upper_inclusive: bool = True, **kwargs: bool)[source]
Bases:
BaseGenericDateTransformerTransformer to generate a boolean column indicating if one date is between two others.
If any row has column_lower greater than column_upper, the output column for that row will be null instead of raising a warning.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- column_lowerstr
Name of date column to subtract. This attribute is not for use in any method, use ‘columns’ instead. Here only as a fix to allow string representation of transformer.
- column_upperstr
Name of date column to subtract from. This attribute is not for use in any method, use ‘columns instead. Here only as a fix to allow string representation of transformer.
- column_betweenstr
Name of column to check if it’s values fall between column_lower and column_upper. This attribute is not for use in any method, use ‘columns instead. Here only as a fix to allow string representation of transformer.
- columnslist
Contains the names of the columns to compare in the order [column_lower, column_between column_upper].
- new_column_namestr
new_column_name argument passed when initialising the transformer.
- lower_inclusivebool
lower_inclusive argument passed when initialising the transformer.
- upper_inclusivebool
upper_inclusive argument passed when initialising the transformer.
- drop_original: bool
indicates whether to drop original columns.
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> BetweenDatesTransformer( … columns=[“a”, “b”, “c”], … new_column_name=”b_between_a_c”, … lower_inclusive=True, … upper_inclusive=True, … ) BetweenDatesTransformer(columns=[‘a’, ‘b’, ‘c’],
new_column_name=’b_between_a_c’)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = BetweenDatesTransformer( … columns=[“a”, “b”, “c”], … new_column_name=”b_between_a_c”, … lower_inclusive=True, … upper_inclusive=False, … ) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘BetweenDatesTransformer’, ‘init’: {‘columns’: [‘a’, ‘b’, ‘c’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘new_column_name’: ‘b_between_a_c’, ‘drop_original’: False, ‘lower_inclusive’: True, ‘upper_inclusive’: False}, ‘fit’: {’is_fitted_’: True}}
- transform(X: FrameT) FrameT[source]
Transform - creates column indicating if middle date is between the other two.
Rows where the lower bound is greater than the upper bound will produce null in the resulting output column for that row.
- Parameters:
X (pd/pl/nw.DataFrame) – Data to transform.
- Returns:
X (pd/pl/nw.DataFrame) – Input X with additional column (self.new_column_name) added. This column is boolean and indicates if the middle column is between the other 2.
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = BetweenDatesTransformer(
… columns=[“a”, “b”, “c”],
… new_column_name=”b_between_a_c”,
… lower_inclusive=True,
… upper_inclusive=True,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([)
… datetime.date(1990, 9, 27),
… datetime.date(2005, 10, 7),
… datetime.date(2010, 1, 1),
… ],
… “b” ([)
… datetime.date(1991, 5, 22),
… datetime.date(2001, 12, 10),
… datetime.date(2009, 1, 1),
… ],
… “c” ([)
… datetime.date(1993, 4, 20),
… datetime.date(2007, 11, 8),
… datetime.date(2008, 1, 1),
… ],
… },
… )
>>> transformer.transform(test_df)
shape ((3, 4))
┌────────────┬────────────┬────────────┬───────────────┐
│ a ┆ b ┆ c ┆ b_between_a_c │
│ — ┆ — ┆ — ┆ — │
│ date ┆ date ┆ date ┆ bool │
╞════════════╪════════════╪════════════╪═══════════════╡
│ 1990-09-27 ┆ 1991-05-22 ┆ 1993-04-20 ┆ true │
│ 2005-10-07 ┆ 2001-12-10 ┆ 2007-11-08 ┆ false │
│ 2010-01-01 ┆ 2009-01-01 ┆ 2008-01-01 ┆ null │
└────────────┴────────────┴────────────┴───────────────┘
- class tubular.dates.DateDiffLeapYearTransformer(**kwargs)[source]
Bases:
BaseGenericDateTransformerTransformer to calculate the number of years between two dates.
- !!! warning “Deprecated”
This transformer is now deprecated; use DateDifferenceTransformer instead.
- columns
List of 2 columns. First column will be subtracted from second.
- Type:
List[str]
- new_column_name
Name given to calculated datediff column. If None then {column_upper}_{column_lower}_datediff will be used.
- Type:
str, default = None
- drop_original
Indicator whether to drop old columns during transform method.
- Type:
bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = True
- transform(X: FrameT) FrameT[source]
Calculate year gap between the two provided columns.
New column is created under the ‘new_column_name’, and optionally removes the old date columns.
- Parameters:
X (pd/pl/nw.DataFrame) – Data containing self.columns
- Returns:
X – Data containing self.columns
- Return type:
pd/pl/nw.DataFrame
- class tubular.dates.DateDifferenceTransformer(columns: ]], new_column_name: str, units: ]] = 'D', drop_original: bool = False, custom_days_divider: int | None = None, **kwargs: bool)[source]
Bases:
BaseGenericDateTransformerClass to transform calculate the difference between 2 date fields in specified units.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = DateDifferenceTransformer( … columns=[“a”, “b”], … new_column_name=”bla”, … units=”common_year”, … ) >>> transformer DateDifferenceTransformer(columns=[‘a’, ‘b’], new_column_name=’bla’,
units=’common_year’)
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'DateDifferenceTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'bla', 'drop_original': False, 'units': 'common_year', 'custom_days_divider': None}, 'fit': {'is_fitted_': True}}
>>> DateDifferenceTransformer.from_json(json_dump) DateDifferenceTransformer(columns=['a', 'b'], new_column_name='bla', units='common_year')
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = DateDifferenceTransformer(columns=[“a”, “b”], new_column_name=”a_diff_b”)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'DateDifferenceTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'a_diff_b', 'drop_original': False, 'units': 'D', 'custom_days_divider': None}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Calculate the difference between the given fields in the specified units.
- Parameters:
X (DataFrame) – Data containing self.columns
- Returns:
dataframe with added date difference column
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> import datetime
>>> transformer = DateDifferenceTransformer( ... columns=["a", "b"], ... new_column_name="a_b_difference_years", ... units="common_year", ... )
>>> test_df = pl.DataFrame( ... { ... "a": [datetime.date(1993, 9, 27), datetime.date(2005, 10, 7)], ... "b": [datetime.date(1991, 5, 22), datetime.date(2001, 12, 10)], ... }, ... )
>>> transformer.transform(test_df) shape: (2, 3) ┌────────────┬────────────┬──────────────────────┐ │ a ┆ b ┆ a_b_difference_years │ │ --- ┆ --- ┆ --- │ │ date ┆ date ┆ f64 │ ╞════════════╪════════════╪══════════════════════╡ │ 1993-09-27 ┆ 1991-05-22 ┆ -2.353425 │ │ 2005-10-07 ┆ 2001-12-10 ┆ -3.827397 │ └────────────┴────────────┴──────────────────────┘
- class tubular.dates.DateDifferenceUnitsOptions(*values)[source]
Bases:
str,EnumOptions for return units in DateDifferenceTransformer.
- COMMON_YEAR = 'common_year'
- CUSTOM_DAYS = 'custom_days'
- DAYS = 'D'
- FORTNIGHT = 'fortnight'
- HOURS = 'h'
- LUNAR_MONTH = 'lunar_month'
- MINUTES = 'm'
- SECONDS = 's'
- WEEK = 'week'
- class tubular.dates.DatetimeComponentExtractor(columns: str | list[str], include: ]], **kwargs: str | bool)[source]
Bases:
BaseDatetimeTransformerTransformer to extract numeric datetime components.
Attributes:
- columns: List[str]
List of columns for processing
- includelist of str
Which numeric datetime components to extract
- polars_compatiblebool
Indicates whether transformer has been converted to polars/pandas agnostic framework
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
- jsonable: bool
Indicates if transformer supports to/from_json methods
- FITS: bool
Indicates whether transform requires fit to be run first
Example:
```pycon >>> transformer = DatetimeComponentExtractor( … columns=”a”, … include=[“hour”, “day”], … ) >>> transformer DatetimeComponentExtractor(columns=[‘a’], include=[‘hour’, ‘day’])
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'DatetimeComponentExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['hour', 'day']}, 'fit': {'is_fitted_': True}}
>>> DatetimeComponentExtractor.from_json(json_dump) DatetimeComponentExtractor(columns=['a'], include=['hour', 'day'])
- FITS = False
- INCLUDE_OPTIONS: ClassVar[list[str]] = ['hour', 'day', 'month', 'year']
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
List of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeComponentExtractor( … columns=[“a”, “b”], … include=[“hour”, “day”], … )
>>> transformer.get_feature_names_out() ['a_hour', 'a_day', 'b_hour', 'b_day']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, Any][source]
Convert transformer to JSON format.
- Returns:
JSON representation of the transformer
- Return type:
dict
Examples
```pycon >>> transformer = DatetimeComponentExtractor( … columns=”a”, … include=[“hour”, “day”], … )
>>> transformer.to_json() {'tubular_version': '...', 'classname': 'DatetimeComponentExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['hour', 'day']}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - Extracts numeric datetime components.
- Parameters:
X (DataFrame) – Data with columns to extract info from.
- Returns:
X – Transformed input X with added columns of extracted information.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> import datetime
>>> transformer = DatetimeComponentExtractor( ... columns="a", ... include=["hour", "day"], ... )
>>> test_df = pl.DataFrame( ... { ... "a": [ ... datetime.datetime(1993, 9, 27, 14, 30), ... datetime.datetime(2005, 10, 7, 9, 45), ... ], ... "b": [ ... datetime.datetime(1991, 5, 22, 18, 0), ... datetime.datetime(2001, 12, 10, 23, 59), ... ], ... }, ... )
>>> transformer.transform(test_df) shape: (2, 4) ┌─────────────────────┬─────────────────────┬────────┬───────┐ │ a ┆ b ┆ a_hour ┆ a_day │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ f32 ┆ f32 │ ╞═════════════════════╪═════════════════════╪════════╪═══════╡ │ 1993-09-27 14:30:00 ┆ 1991-05-22 18:00:00 ┆ 14.0 ┆ 27.0 │ │ 2005-10-07 09:45:00 ┆ 2001-12-10 23:59:00 ┆ 9.0 ┆ 7.0 │ └─────────────────────┴─────────────────────┴────────┴───────┘
- class tubular.dates.DatetimeComponentOptions(*values)[source]
Bases:
str,EnumContains options for DatetimeComponentExtractor.
- DAY = 'day'
- HOUR = 'hour'
- MONTH = 'month'
- YEAR = 'year'
- class tubular.dates.DatetimeInfoExtractor(columns: str | list[str], include: ]] | None = None, datetime_mappings: dict[~typing.Annotated[str, beartype.vale.Is[lambda s: ...]], dict[int, str]] | None = None, drop_original: bool | None = False, **kwargs: str | bool)[source]
Bases:
BaseDatetimeTransformerTransformer to extract various features from datetime var.
Attributes:
- columns: List[str]
List of columns for processing
- includelist of str, default = [“timeofday”, “timeofmonth”, “timeofyear”, “dayofweek”]
Which datetime categorical information to extract
- datetime_mappingsdict, default = None
Optional argument to define custom mappings for datetime values.
- drop_original: str
indicates whether to drop provided columns post transform
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = DatetimeInfoExtractor( … columns=”a”, … include=”timeofday”, … ) >>> transformer DatetimeInfoExtractor(columns=[‘a’], datetime_mappings={},
include=[‘timeofday’])
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'DatetimeInfoExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['timeofday'], 'datetime_mappings': {}}, 'fit': {'is_fitted_': True}}
- DATETIME_ATTR: ClassVar[dict[str, str]] = {'dayofweek': 'weekday', 'timeofday': 'hour', 'timeofmonth': 'day', 'timeofyear': 'month'}
- DEFAULT_MAPPINGS: ClassVar[dict[str, dict[int, str]]] = {'dayofweek': {1: 'monday', 2: 'tuesday', 3: 'wednesday', 4: 'thursday', 5: 'friday', 6: 'saturday', 7: 'sunday'}, 'timeofday': {0: 'night', 1: 'night', 2: 'night', 3: 'night', 4: 'night', 5: 'night', 6: 'morning', 7: 'morning', 8: 'morning', 9: 'morning', 10: 'morning', 11: 'morning', 12: 'afternoon', 13: 'afternoon', 14: 'afternoon', 15: 'afternoon', 16: 'afternoon', 17: 'afternoon', 18: 'evening', 19: 'evening', 20: 'evening', 21: 'evening', 22: 'evening', 23: 'evening'}, 'timeofmonth': {1: 'start', 2: 'start', 3: 'start', 4: 'start', 5: 'start', 6: 'start', 7: 'start', 8: 'start', 9: 'start', 10: 'start', 11: 'middle', 12: 'middle', 13: 'middle', 14: 'middle', 15: 'middle', 16: 'middle', 17: 'middle', 18: 'middle', 19: 'middle', 20: 'middle', 21: 'end', 22: 'end', 23: 'end', 24: 'end', 25: 'end', 26: 'end', 27: 'end', 28: 'end', 29: 'end', 30: 'end', 31: 'end'}, 'timeofyear': {1: 'winter', 2: 'winter', 3: 'spring', 4: 'spring', 5: 'spring', 6: 'summer', 7: 'summer', 8: 'summer', 9: 'autumn', 10: 'autumn', 11: 'autumn', 12: 'winter'}}
- FITS = False
- INCLUDE_OPTIONS: ClassVar[list[str]] = ['timeofday', 'timeofmonth', 'timeofyear', 'dayofweek']
- RANGE_TO_MAP: ClassVar[dict[str, set[int]]] = {'dayofweek': {1, 2, 3, 4, 5, 6, 7}, 'timeofday': {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}, 'timeofmonth': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}, 'timeofyear': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}}
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeInfoExtractor( … columns=[“a”, “b”], … include=[“timeofday”, “timeofmonth”], … )
>>> transformer.get_feature_names_out() ['a_timeofday', 'a_timeofmonth', 'b_timeofday', 'b_timeofmonth']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
>>> transformer=DatetimeInfoExtractor(columns='a')
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'DatetimeInfoExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['timeofday', 'timeofmonth', 'timeofyear', 'dayofweek'], 'datetime_mappings': {}}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - Extracts new features from datetime variables.
- Parameters:
X (DataFrame) – Data with columns to extract info from.
- Returns:
X (DataFrame) – Transformed input X with added columns of extracted information.
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = DatetimeInfoExtractor(
… columns=”a”,
… include=”timeofmonth”,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([datetime.datetime(1993, 9, 27), datetime.datetime(2005, 10, 7)],)
… “b” ([datetime.datetime(1991, 5, 22), datetime.datetime(2001, 12, 10)],)
… },
… )
>>> transformer.transform(test_df)
shape ((2, 3))
┌─────────────────────┬─────────────────────┬───────────────┐
│ a ┆ b ┆ a_timeofmonth │
│ — ┆ — ┆ — │
│ datetime[μs] ┆ datetime[μs] ┆ enum │
╞═════════════════════╪═════════════════════╪═══════════════╡
│ 1993-09-27 00 (00:00 ┆ 1991-05-22 00:00:00 ┆ end │)
│ 2005-10-07 00 (00:00 ┆ 2001-12-10 00:00:00 ┆ start │)
└─────────────────────┴─────────────────────┴───────────────┘
- class tubular.dates.DatetimeInfoOptions(*values)[source]
Bases:
str,EnumOptions for what is returned by DatetimeInfoExtractor.
- DAY_OF_WEEK = 'dayofweek'
- TIME_OF_DAY = 'timeofday'
- TIME_OF_MONTH = 'timeofmonth'
- TIME_OF_YEAR = 'timeofyear'
- class tubular.dates.DatetimeSinusoidCalculator(columns: str | list[str], method: ]], units: ]]], period: ]]] = 6.283185307179586, drop_original: bool = False, **kwargs: bool | str)[source]
Bases:
BaseDatetimeTransformerCalculate the sine or cosine of a datetime column in a given unit (e.g hour).
Includes the option to scale period of the sine or cosine to match the natural period of the unit (e.g. 24).
Attributes:
- columnsstr or list
Columns to take the sine or cosine of.
- methodstr or list
The function to be calculated; either sin, cos or a list containing both.
- unitsstr or dict
Which time unit the calculation is to be carried out on. Will take any of ‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘microsecond’. Can be a string or a dict containing key-value pairs of column name and units to be used for that column.
- periodstr, float or dict, default = 2*np.pi
The period of the output in the units specified above. Can be a string or a dict containing key-value pairs of column name and units to be used for that column.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … ) DatetimeSinusoidCalculator(columns=[‘a’], method=[‘sin’], units=’month’)
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … )
>>> transformer.get_feature_names_out() ['sin_6.283185307179586_month_a']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … ) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘DatetimeSinusoidCalculator’, ‘init’: {‘columns’: [‘a’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘new_column_name’: ‘dummy’, ‘drop_original’: False, ‘method’: [‘sin’], ‘units’: ‘month’, ‘period’: 6.283185307179586}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - creates column containing sine or cosine of another datetime column.
Which function is used is stored in the self.method attribute.
- Parameters:
X (pd/pl/nw.DataFrame) – Data to transform.
return_native_override (Optional[bool]) – Option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X (pd/pl/nw.DataFrame) – Input X with additional columns added, these are named “<method>_<original_column>”
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = DatetimeSinusoidCalculator(
… columns=”a”,
… method=”sin”,
… units=”month”,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([datetime.datetime(1993, 9, 27), datetime.datetime(2005, 10, 7)],)
… “b” ([datetime.datetime(1991, 5, 22), datetime.datetime(2001, 12, 10)],)
… },
… )
>>> transformer.transform(test_df)
shape ((2, 3))
┌─────────────────────┬─────────────────────┬───────────────────────────────┐
│ a ┆ b ┆ sin_6.283185307179586_month_a │
│ — ┆ — ┆ — │
│ datetime[μs] ┆ datetime[μs] ┆ f64 │
╞═════════════════════╪═════════════════════╪═══════════════════════════════╡
│ 1993-09-27 00 (00:00 ┆ 1991-05-22 00:00:00 ┆ 0.412118 │)
│ 2005-10-07 00 (00:00 ┆ 2001-12-10 00:00:00 ┆ -0.544021 │)
└─────────────────────┴─────────────────────┴───────────────────────────────┘
- class tubular.dates.DatetimeSinusoidUnitsOptions(*values)[source]
Bases:
str,EnumOptions for units argument of DatetimeSinusoidCalculator.
- DAY = 'day'
- HOUR = 'hour'
- MICROSECOND = 'microsecond'
- MINUTE = 'minute'
- MONTH = 'month'
- SECOND = 'second'
- YEAR = 'year'
- class tubular.dates.MethodOptions(*values)[source]
Bases:
str,EnumOptions for method arg of DatetimeSinusoidCalculator.
- COS = 'cos'
- SIN = 'sin'
- class tubular.dates.SeriesDtMethodTransformer(**kwargs)[source]
Bases:
BaseDatetimeTransformerTransformer that applies a pandas.Series.dt method.
Transformer assigns the output of the method to a new column. It is possible to supply other key word arguments to the transform method, which will be passed to the pandas.Series.dt method being called.
Be aware it is possible to supply incompatible arguments to init that will only be identified when transform is run. This is because there are many combinations of method, input and output sizes. Additionally some methods may only work as expected when called in transform with specific key word arguments.
- column
Name of column to apply transformer to. This attribute is not for use in any method, use ‘columns instead. Here only as a fix to allow string representation of transformer.
- Type:
str
- columns
Column name for transformation.
- Type:
str
- new_column_name
The name of the column or columns to be assigned to the output of running the pandas method in transform.
- Type:
str
- pd_method_name
The name of the pandas.DataFrame method to call.
- Type:
str
- pd_method_kwargs
Dictionary of keyword arguments to call the pd.Series.dt method with.
- Type:
dict
- drop_original
Indicates whether to drop self.column post transform
- Type:
bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Transform specific column on input pandas.DataFrame (X) using the given pandas.Series.dt method.
Any keyword arguments set in the pd_method_kwargs attribute are passed onto the pd.Series.dt method when calling it.
- Parameters:
X (pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional column (self.new_column_name) added. These contain the output of running the pd.Series.dt method.
- Return type:
pd.DataFrame
- class tubular.dates.ToDatetimeTransformer(columns: str | list[str], time_format: str | None = None, **kwargs: bool)[source]
Bases:
BaseTransformerClass to transform convert specified columns to datetime.
Class simply uses the pd.to_datetime method on the specified columns.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = ToDatetimeTransformer( … columns=”a”, … time_format=”%d/%m/%Y”, … ) >>> transformer ToDatetimeTransformer(columns=[‘a’], time_format=’%d/%m/%Y’)
>>> # version will vary for local vs CI, so use ... as generic match >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ToDatetimeTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'time_format': '%d/%m/%Y'}, 'fit': {'is_fitted_': True}}
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = ToDatetimeTransformer(columns=”a”, time_format=”%d/%m/%Y”)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'ToDatetimeTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'time_format': '%d/%m/%Y'}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Convert specified column to datetime using pd.to_datetime.
- Parameters:
X (DataFrame) – Data with column to transform.
- Returns:
dataframe with provided columns converted to datetime
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = ToDatetimeTransformer( ... columns="a", ... time_format="%d/%m/%Y", ... )
>>> test_df = pl.DataFrame({"a": ["01/02/2020", "10/12/1996"], "b": [1, 2]})
>>> transformer.transform(test_df) shape: (2, 2) ┌─────────────────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2020-02-01 00:00:00 ┆ 1 │ │ 1996-12-10 00:00:00 ┆ 2 │ └─────────────────────┴─────┘
tubular.imputers module
Contains transformers that deal with imputation of missing values.
- class tubular.imputers.ArbitraryImputer(impute_value: int | float | str | bool, columns: str | list[str], **kwargs: bool | None)[source]
Bases:
BaseImputerTransformer to impute null values with an arbitrary pre-defined value.
- impute_value
Value to impute nulls with.
- Type:
int or float or str or bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> arbitrary_imputer = ArbitraryImputer(columns=[“a”, “b”], impute_value=5) >>> arbitrary_imputer ArbitraryImputer(columns=[‘a’, ‘b’], impute_value=5)
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = arbitrary_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ArbitraryImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'impute_value': 5}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 5, 'b': 5}}}
>>> ArbitraryImputer.from_json(json_dump) ArbitraryImputer(columns=['a', 'b'], impute_value=5)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Impute missing values with the supplied impute_value.
- Parameters:
X (DataFrame) – Data containing columns to impute.
- Returns:
X (DataFrame) – Transformed input X with nulls imputed with the specified impute_value, for the specified columns.
Example
——–
>>> import polars as pl
>>> test_df = pl.DataFrame({“a” ([1, None, 2], “b”: [3, None, 4]}))
>>> imputer = ArbitraryImputer(columns=[“a”, “b”], impute_value=5)
>>> imputer.transform(test_df)
shape ((3, 2))
┌─────┬─────┐
│ a ┆ b │
│ — ┆ — │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 5 ┆ 5 │
│ 2 ┆ 4 │
└─────┴─────┘
- class tubular.imputers.BaseImputer(columns: ]] | str, copy: bool = False, verbose: bool = False, return_native: bool = True)[source]
Bases:
BaseTransformerContains transform method that will use fill nulls with values from self.impute_values_.
Other imputers in this module should inherit from this class.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> BaseImputer(columns=[“a”, “b”]) BaseImputer(columns=[‘a’, ‘b’])
- FITS = False
- jsonable = False
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
- Raises:
RuntimeError: – if class is not jsonable
Examples
```pycon >>> arbitrary_imputer = ArbitraryImputer(columns=[“a”, “b”], impute_value=1)
>>> # version will vary for local vs CI, so use ... as generic match >>> arbitrary_imputer.to_json() {'tubular_version': ..., 'classname': 'ArbitraryImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'impute_value': 1}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 1, 'b': 1}}}
>>> mean_imputer = MeanImputer(columns=["a", "b"])
>>> test_df = pl.DataFrame({"a": [1, None], "b": [None, 2]})
>>> _ = mean_imputer.fit(test_df)
>>> mean_imputer.to_json() {'tubular_version': ..., 'classname': 'MeanImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 1.0, 'b': 2.0}}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Impute missing values with values calculated from fit method.
- Parameters:
X (DataFrame) – Data to impute.
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Transformed input X with nulls imputed with the median value for the specified columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> imputer = BaseImputer(columns=["a", "b"])
>>> imputer.impute_values_ = {"a": 2, "b": 3.5}
>>> test_df = pl.DataFrame({"a": [1, None, 2], "b": [3, None, 4]})
>>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 3.0 │ │ 2 ┆ 3.5 │ │ 2 ┆ 4.0 │ └─────┴─────┘
- class tubular.imputers.MeanImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
WeightColumnMixin,BaseImputerTransformer to impute missing values with the mean of the supplied columns.
- impute_values_
Created during fit method. Dictionary of float / int (mean) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> mean_imputer = MeanImputer( … columns=[“a”, “b”], … ) >>> mean_imputer MeanImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = mean_imputer.fit(test_df)
>>> json_dump = mean_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MeanImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0.0, 'b': 1.0}}}
>>> MeanImputer.from_json(json_dump) MeanImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) MeanImputer[source]
Calculate mean values to impute with from X.
- Parameters:
X (DataFrame) – Data to “learn” the mean values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance.
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = MeanImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 1.0 ┆ 3.0 │ │ 1.5 ┆ 3.5 │ │ 2.0 ┆ 4.0 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.imputers.MedianImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseImputer,WeightColumnMixinTransformer to impute missing values with the median of the supplied columns.
- impute_values_
Created during fit method. Dictionary of float / int (median) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> median_imputer = MedianImputer( … columns=[“a”, “b”], … ) >>> median_imputer MedianImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = median_imputer.fit(test_df)
>>> json_dump = median_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MedianImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0.0, 'b': 1.0}}}
>>> MedianImputer.from_json(json_dump) MedianImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) MedianImputer[source]
Calculate median values to impute with from X.
- Parameters:
X (DataFrame) – Data to “learn” the median values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance.
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = MedianImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 1.0 ┆ 3.0 │ │ 1.5 ┆ 3.5 │ │ 2.0 ┆ 4.0 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.imputers.ModeImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseImputer,WeightColumnMixinTransformer to impute missing values with the mode of the supplied columns.
If mode is NaN, a warning will be raised.
- impute_values_
Created during fit method. Dictionary of float / int (mode) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> mode_imputer = ModeImputer( … columns=[“a”, “b”], … ) >>> mode_imputer ModeImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = mode_imputer.fit(test_df)
>>> json_dump = mode_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ModeImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0, 'b': 1}}}
>>> ModeImputer.from_json(json_dump) ModeImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) ModeImputer[source]
Calculate mode values to impute with from X.
In the event of a tie, the highest modal value will be returned.
- Parameters:
X (DataFrame) – Data to “learn” the mode values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = ModeImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ │ 2 ┆ 4 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.imputers.NearestMeanResponseImputer(**kwargs)[source]
Bases:
BaseImputerImpute nulls with the value where the average target is most similar to that for the nulls.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = True
- deprecated = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series) NearestMeanResponseImputer[source]
Calculate mean values to impute with.
- Parameters:
X (FrameT) – Data to fit the transformer on.
y (nw.Series) – Response column used to determine the value to impute with. The average response for each level of every column is calculated. The level which has the closest average response to the average response of the unknown levels is selected as the imputation value.
- Returns:
NearestMeanResponseImputer
- Return type:
fitted class instance
- Raises:
ValueError – provided y contains nulls:
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = True
- class tubular.imputers.NullIndicator(columns: ]] | str, **kwargs: bool | None)[source]
Bases:
BaseTransformerClass to create a binary indicator column for null values.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> null_indicator = NullIndicator( … columns=[“a”, “b”], … ) >>> null_indicator NullIndicator(columns=[‘a’, ‘b’])
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = null_indicator.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'NullIndicator', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True}, 'fit': {'is_fitted_': True}}
>>> NullIndicator.from_json(json_dump) NullIndicator(columns=['a', 'b'])
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create new columns indicating the position of null values for each variable in self.columns.
- Parameters:
X (DataFrame) – Data to add indicators to.
- Returns:
dataframe with null indicator columns added
- Return type:
DataFrame
Examples
——–, ```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = NullIndicator(columns=[“a”, “b”]) >>> imputer.transform(test_df) shape: (3, 4) ┌──────┬──────┬─────────┬─────────┐ │ a ┆ b ┆ a_nulls ┆ b_nulls │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ bool │ ╞══════╪══════╪═════════╪═════════╡ │ 1 ┆ 3 ┆ false ┆ false │ │ null ┆ null ┆ true ┆ true │ │ 2 ┆ 4 ┆ false ┆ false │ └──────┴──────┴─────────┴─────────┘
tubular.mapping module
Contains transformers that apply different types of mappings to columns.
- class tubular.mapping.BaseCrossColumnMappingTransformer(**kwargs)[source]
Bases:
BaseMappingTransformerBaseMappingTransformer Extension for cross column mapping transformers.
- adjust_column
Column containing the values to be adjusted.
- Type:
str
- mappings
Dictionary of mappings for each column individually to be applied to the adjust_column. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Check X is valid for transform and calls parent transform.
- Parameters:
X (pd.DataFrame) – Data to apply adjustments to.
- Returns:
X – Transformed data X with adjustments applied to specified columns.
- Return type:
pd.DataFrame
- Raises:
ValueError: – if provided adjust_column is not in DataFrame.
- class tubular.mapping.BaseCrossColumnNumericTransformer(**kwargs)[source]
Bases:
BaseCrossColumnMappingTransformerBaseCrossColumnNumericTransformer Extension for cross column numerical mapping transformers.
- adjust_column
Column containing the values to be adjusted.
- Type:
str
- mappings
Dictionary of mappings for each column individually to be applied to the adjust_column. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Check X is valid for transform and calls parent transform.
- Parameters:
X (pd.DataFrame) – Data to apply adjustments to.
- Returns:
X – Transformed data X with adjustments applied to specified columns.
- Return type:
pd.DataFrame
- Raises:
TypeError: – if provided columns are non-numeric
- class tubular.mapping.BaseMappingTransformMixin(columns: ]] | str, copy: bool = False, verbose: bool = False, return_native: bool = True)[source]
Bases:
BaseTransformerMixin class to apply mappings to columns method.
Transformer uses the mappings attribute which should be a dict of dicts/mappings for each required column.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- FITS = False
- jsonable = False
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply mapping defined in the mappings dict to each column in the columns attribute.
- Parameters:
X (DataFrame) – Data with nominal columns to transform.
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X (DataFrame) – Transformed input X with levels mapped according to mappings dict.
# not currently including doctest for this, as is not intended to be used
# independently (should be inherited as a mixin)
- class tubular.mapping.BaseMappingTransformer(mappings: dict[str, dict[Any, Any]], return_dtypes: dict[str, RETURN_DTYPES] | None = None, **kwargs: bool | None)[source]
Bases:
BaseTransformerBase Transformer Extension for mapping transformers.
- mappings
Dictionary of mappings for each column individually. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- mappings_from_null
dict storing what null values will be mapped to. Generally best to use an imputer, but this functionality is useful for inverting pipelines.
- Type:
dict[str, Any]
- return_dtypes
Dictionary of col:dtype for returned columns
- Type:
dict[str, RETURN_DTYPES]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> BaseMappingTransformer( … mappings={“a”: {“Y”: 1, “N”: 0}}, … return_dtypes={“a”: “Int8”}, … ) BaseMappingTransformer(mappings={‘a’: {‘N’: 0, ‘Y’: 1}},
return_dtypes={‘a’: ‘Int8’})
- FITS = False
- RETURN_DTYPES
alias of
Literal[‘String’, ‘Object’, ‘Categorical’, ‘Boolean’, ‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘Float32’, ‘Float64’]
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> mapping_transformer = BaseMappingTransformer(mappings={“a”: {“x”: 1}})
>>> mapping_transformer.to_json() {'tubular_version': ..., 'classname': 'BaseMappingTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'mappings': {'a': {'x': 1}}, 'return_dtypes': {'a': 'Int64'}}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Check mappings dict has been fitted.
- Parameters:
X (DataFrame) – Data to apply mappings to.
return_native_override (Optional[bool]) – option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Input X, copied if specified by user.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = BaseMappingTransformer( ... mappings={"a": {"Y": 1, "N": 0}}, ... return_dtypes={"a": "Int8"}, ... )
>>> test_df = pl.DataFrame({"a": ["Y", "N"], "b": [3, 4]})
>>> # base class transform has no effect on data >>> transformer.transform(test_df) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ Y ┆ 3 │ │ N ┆ 4 │ └─────┴─────┘
- class tubular.mapping.CrossColumnAddTransformer(**kwargs)[source]
Bases:
BaseCrossColumnNumericTransformerTransformer to apply an additive adjustment to values in one column based on the values of another column.
- adjust_column
Column containing the values to be adjusted.
- Type:
str
- mappings
Dictionary of additive adjustments for each column individually to be applied to the adjust_column. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.mapping.CrossColumnMappingTransformer(**kwargs)[source]
Bases:
BaseCrossColumnMappingTransformerTransformer to adjust values in one column based on the values of another column.
- adjust_column
Column containing the values to be adjusted.
- Type:
str
- mappings
Dictionary of mappings for each column individually to be applied to the adjust_column. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.mapping.CrossColumnMultiplyTransformer(**kwargs)[source]
Bases:
BaseCrossColumnNumericTransformerTransformer to apply a multiplicative adjustment to values in one column based on the values of another column.
- adjust_column
Column containing the values to be adjusted.
- Type:
str
- mappings
Dictionary of multiplicative adjustments for each column individually to be applied to the adjust_column. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.mapping.MappingTransformer(mappings: dict[str, dict[Any, Any]], return_dtypes: dict[str, RETURN_DTYPES] | None = None, **kwargs: bool | None)[source]
Bases:
BaseMappingTransformer,BaseMappingTransformMixinTransformer to map values in columns to other values e.g. to merge two levels into one.
Note, the MappingTransformer does not require ‘self-mappings’ to be defined i.e. if you want to map a value to itself, you can omit this value from the mappings rather than having to map it to itself.
This transformer inherits from BaseMappingTransformMixin as well as the BaseMappingTransformer, BaseMappingTransformer performs standard checks, while BasemappingTransformMixin handles the actual logic.
- Parameters:
mappings (dict) – Dictionary containing column mappings. Each value in mappings should be a dictionary of key (column to apply mapping to) value (mapping dict for given columns) pairs. For example the following dict {‘a’: {1: 2, 3: 4}, ‘b’: {‘a’: 1, ‘b’: 2}} would specify a mapping for column a of 1->2, 3->4 and a mapping for column b of ‘a’->1, b->2.
return_dtype (Optional[Dict[str, RETURN_DTYPES]]) – Dictionary of col:dtype for returned columns
**kwargs – Arbitrary keyword arguments passed onto BaseMappingTransformer.init method.
- mappings
Dictionary of mappings for each column individually. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- mappings_from_null
dict storing what null values will be mapped to. Generally best to use an imputer, but this functionality is useful for inverting pipelines.
- Type:
dict[str, Any]
- return_dtypes
Dictionary of col:dtype for returned columns
- Type:
dict[str, RETURN_DTYPES]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = MappingTransformer( … mappings={“a”: {“Y”: 1, “N”: 0}}, … return_dtypes={“a”: “Int8”}, … ) >>> transformer MappingTransformer(mappings={‘a’: {‘N’: 0, ‘Y’: 1}},
return_dtypes={‘a’: ‘Int8’})
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MappingTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'mappings': {'a': {'Y': 1, 'N': 0}}, 'return_dtypes': {'a': 'Int8'}}, 'fit': {'is_fitted_': True}}
>>> MappingTransformer.from_json(json_dump) MappingTransformer(mappings={'a': {'N': 0, 'Y': 1}}, return_dtypes={'a': 'Int8'})
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the input data X according to the mappings in the mappings attribute dict.
This method calls the BaseMappingTransformMixin.transform. Note, this transform method is different to some of the transform methods in the nominal module, even though they also use the BaseMappingTransformMixin.transform method. Here, if a value does not exist in the mapping it is unchanged.
- Parameters:
X (DataFrame) – Data with nominal columns to transform.
- Returns:
X – Transformed input X with levels mapped according to mappings dict.
- Return type:
DataFrame
Examples
``pycon >>> import polars as pl
>>> transformer = MappingTransformer( ... mappings={'a': {'Y': 1, 'N': 0}}, ... return_dtypes={"a":"Int8"}, ... )
>>> test_df=pl.DataFrame({'a': ["Y", "N"], 'b': [3,4]})
>>> transformer.transform(test_df) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i8 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 0 ┆ 4 │ └─────┴─────┘
tubular.misc module
Contains legacy transformers for introducing fixed columns and changing dtypes.
- class tubular.misc.ColumnDtypeSetter(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], dtype: ]], **kwargs: bool)[source]
Bases:
BaseTransformerTransformer to set transform columns in a dataframe to a dtype.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = ColumnDtypeSetter(columns=”a”, dtype=”Float32”) >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘ColumnDtypeSetter’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’],
‘copy’: False, ‘dtype’: ‘Float32’, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform data.
- Parameters:
X (DataFrame) – data to transform.
- Returns:
DataFrame
- Return type:
transformed data
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2]}) >>> transformer = ColumnDtypeSetter(columns=”a”, dtype=”Float32”) >>> transformer.transform(df) shape: (2, 1) ┌─────┐ │ a │ │ — │ │ f32 │ ╞═════╡ │ 1.0 │ │ 2.0 │ └─────┘
- class tubular.misc.RenameColumnsTransformer(columns: ]] | str, new_column_names: dict[str, str], drop_original: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,DropOriginalMixinTransformer to rename a given set of columns.
This can be useful for personalising the auto-output names from other transformers, or for creating a few different versions of a given column to undergo separate paths of logic in a pipeline (as the expression logic effectively creates duplicates of the column).
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = RenameColumnsTransformer( … columns=”a”, new_column_names={“a”: “new_a”} … ) # noqa: E501 >>> transformer RenameColumnsTransformer(columns=[‘a’], new_column_names={‘a’: ‘new_a’})
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> pprint(json_dump, sort_dicts=True) {'classname': 'RenameColumnsTransformer', 'fit': {'is_fitted_': True}, 'init': {'columns': ['a'], 'copy': False, 'drop_original': True, 'new_column_names': {'a': 'new_a'}, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> RenameColumnsTransformer.from_json(json_dump) RenameColumnsTransformer(columns=['a'], new_column_names={'a': 'new_a'})
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = RenameColumnsTransformer( … columns=[“a”, “b”], … new_column_names={“a”: “new_a”, “b”: “new_b”}, … )
>>> transformer.get_feature_names_out() ['new_a', 'new_b']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = RenameColumnsTransformer( … columns=”a”, new_column_names={“a”: “new_a”} … ) # noqa: E501 >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘RenameColumnsTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’],
‘copy’: False, ‘drop_original’: True, ‘new_column_names’: {‘a’: ‘new_a’}, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create column copies.
- Parameters:
X (DataFrame) – Data to apply mappings to.
- Returns:
X – Transformed input X with columns set to value.
- Return type:
DataFrame
- Raises:
ValueError – if new_column_names values are already present in X:
Examples
```pycon >>> import polars as pl
>>> transformer = RenameColumnsTransformer( ... columns="a", new_column_names={"a": "new_a"} ... ) # noqa: E501
>>> test_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> transformer.transform(test_df) shape: (3, 2) ┌─────┬───────┐ │ b ┆ new_a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═══════╡ │ 4 ┆ 1 │ │ 5 ┆ 2 │ │ 6 ┆ 3 │ └─────┴───────┘
- class tubular.misc.SetValueTransformer(columns: ]] | str, value: int | float | str | bool | None, **kwargs: bool)[source]
Bases:
BaseTransformerTransformer to set value of column(s) to a given value.
This should be used if columns need to be set to a constant value.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> SetValueTransformer(columns=”a”, value=1) SetValueTransformer(columns=[‘a’], value=1)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = SetValueTransformer(columns=”a”, value=1) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘SetValueTransformer’, ‘init’: {‘columns’: [‘a’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘value’: 1}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Set columns to value.
- Parameters:
X (DataFrame) – Data to apply mappings to.
- Returns:
X – Transformed input X with columns set to value.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = SetValueTransformer(columns="a", value=1)
>>> test_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> transformer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i32 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 1 ┆ 5 │ │ 1 ┆ 6 │ └─────┴─────┘
- class tubular.misc.SimpleCastDtypes(*values)[source]
Bases:
str,EnumAllowed dtypes for ColumnDtypeSetter.
- BOOLEAN = 'Boolean'
- CATEGORICAL = 'Categorical'
- FLOAT32 = 'Float32'
- FLOAT64 = 'Float64'
- INT16 = 'Int16'
- INT32 = 'Int32'
- INT64 = 'Int64'
- INT8 = 'Int8'
- STRING = 'String'
- UINT16 = 'UInt16'
- UINT32 = 'UInt32'
- UINT64 = 'UInt64'
- UINT8 = 'UInt8'
tubular.mixins module
Contains mixin classes for use across transformers.
- class tubular.mixins.CheckNumericMixin[source]
Bases:
objectMixin class with methods for numeric transformers.
- check_numeric_columns(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native: bool = True) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Check column args are numeric for numeric transformers.
- Parameters:
X (DataFrame) – Data containing columns to check.
return_native (bool) – indicates whether to return nw or pd/pl dataframe
- Returns:
validated dataframe
- Return type:
DataFrame
- Raises:
TypeError: – if provided columns are non-numeric
- class tubular.mixins.DropOriginalMixin[source]
Bases:
objectMixin class to validate and apply ‘drop_original’ argument used by various transformers.
Transformer deletes transformer input columns depending on boolean argument.
- classname() str[source]
Get name of the current class when called.
- Returns:
name of class
- Return type:
str
- static drop_original_column(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, drop_original: bool, columns: list[str] | str | None, return_native: bool = True) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Drop input columns from X if drop_original set to True.
- Parameters:
X (DataFrame) – Data with columns to drop.
drop_original (bool) – boolean dictating dropping the input columns from X after checks.
columns (list[str] | str | None) – Object containing columns to drop
return_native (bool) – controls whether mixin returns native or narwhals type
- Returns:
X – Transformed input X with columns dropped.
- Return type:
DataFrame
- class tubular.mixins.WeightColumnMixin[source]
Bases:
objectMixin class with weights functionality.
- check_weights_column(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, weights_column: str) None[source]
Validate weights column in dataframe.
- Parameters:
X (DataFrame) – input data
weights_column (str) – name of weight column
- Raises:
ValueError: – if weights_column is missing from data
ValueError: – if weights_column is non-numeric
- classname() str[source]
Get the name of the current class when called.
- Returns:
name of class
- Return type:
str
- static get_valid_weights_filter_expr(weights_column: str, verbose: bool = False) Expr[source]
Validate weights column in dataframe.
- Parameters:
weights_column (str) – name of weight column
verbose (bool) – control verbosity of method
- Returns:
nw.Expr
- Return type:
expression to be used for filtering down to valid weights rows
tubular.nominal module
Contains transformers that apply encodings to nominal columns.
- class tubular.nominal.GroupRareLevelsTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] | None = None, cut_off_percent: ]] = 0.01, weights_column: str | None = None, rare_level_name: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] = 'rare', record_rare_levels: bool = True, unseen_levels_to_rare: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,WeightColumnMixinGroup together rare levels of nominal variables into a new rare level.
Rare levels are defined by a cut off percentage, which can either be based on the number of rows or sum of weights. Any levels below this cut off value will be grouped into the rare level.
- cut_off_percent
Cut off percentage (either in terms of number of rows or sum of weight) for a given nominal level to be considered rare.
- Type:
float
- non_rare_levels
Created in fit. A dict of non-rare levels (i.e. levels with more than cut_off_percent weight or rows) that is used to identify rare levels in transform.
- Type:
dict
- rare_level_name
Must be of the same type as columns. Label for the new nominal level that will be added to group together rare levels (as defined by cut_off_percent).
- Type:
any
- record_rare_levels
Should the ‘rare’ levels that will be grouped together be recorded? If not they will be lost after the fit and the only information remaining will be the ‘non’rare’ levels.
- Type:
bool
- rare_levels_record
Only created (in fit) if record_rare_levels is True. This is dict containing a list of levels that were grouped into ‘rare’ for each column the transformer was applied to.
- Type:
dict
- weights_column
Name of weights columns to use if cut_off_percent should be in terms of sum of weight not number of rows.
- Type:
str
- unseen_levels_to_rare
If True, unseen levels in new data will be passed to rare, if set to false they will be left unchanged.
- Type:
bool
- training_data_levels
Dictionary containing the set of values present in the training data for each column in self.columns. It will only exist in if unseen_levels_to_rare is set to False.
- Type:
dict[set]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> GroupRareLevelsTransformer( … columns=”a”, … cut_off_percent=0.02, … rare_level_name=”rare_level”, … ) GroupRareLevelsTransformer(columns=[‘a’], cut_off_percent=0.02,
rare_level_name=’rare_level’)
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) GroupRareLevelsTransformer[source]
Record non-rare levels for categorical variables.
When transform is called, only levels records in non_rare_levels during fit will remain unchanged - all other levels will be grouped. If record_rare_levels is True then the rare levels will also be recorded.
The label for the rare levels must be of the same type as the columns.
- Parameters:
X (DataFrame) – Data to identify non-rare levels from.
y (Series or LazyFrame or None, default = None) – Optional argument only required for the transformer to work with sklearn pipelines.
- Returns:
GroupRareLevelsTransformer
- Return type:
fitted class instance
Examples
```pycon >>> import polars as pl
>>> transformer = GroupRareLevelsTransformer( ... columns="a", ... cut_off_percent=0.02, ... rare_level_name="rare_level", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> transformer.fit(test_df) GroupRareLevelsTransformer(columns=['a'], cut_off_percent=0.02, rare_level_name='rare_level')
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import tests.test_data as d
>>> df = d.create_df_8("pandas")
>>> x = GroupRareLevelsTransformer( ... columns=["b", "c"], cut_off_percent=0.4, unseen_levels_to_rare=False ... )
>>> x.fit(df) GroupRareLevelsTransformer(columns=['b', 'c'], cut_off_percent=0.4, unseen_levels_to_rare=False)
>>> x.to_json() {'tubular_version': ..., 'classname': 'GroupRareLevelsTransformer', 'init': {'columns': ['b', 'c'], 'copy': False, 'verbose': False, 'return_native': True, 'cut_off_percent': 0.4, 'weights_column': None, 'rare_level_name': 'rare', 'record_rare_levels': True, 'unseen_levels_to_rare': False}, 'fit': {'is_fitted_': True, 'non_rare_levels': {'b': ['w'], 'c': ['a']}, 'training_data_levels': {'b': ['w', 'x', 'y', 'z'], 'c': ['a', 'b', 'c']}, 'rare_levels_record': {'b': ['x', 'y', 'z'], 'c': ['b', 'c']}}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Group rare levels together into a new ‘rare’ level.
- Parameters:
X (DataFrame) – Data to with catgeorical variables to apply rare level grouping to.
- Returns:
X – Transformed input X with rare levels grouped for into a new rare level.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = GroupRareLevelsTransformer( ... columns="a", ... cut_off_percent=0.5, ... rare_level_name="rare_level", ... )
>>> test_df = pl.DataFrame({"a": ["x", "x", "y"], "b": ["w", "z", "z"]})
>>> _ = transformer.fit(test_df)
>>> transformer.transform(test_df) shape: (3, 2) ┌────────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ str │ ╞════════════╪═════╡ │ x ┆ w │ │ x ┆ z │ │ rare_level ┆ z │ └────────────┴─────┘
- class tubular.nominal.MeanResponseTransformer(columns: str | list[str] | None = None, weights_column: str | None = None, prior: ]] = 0, level: float | int | str | list | None = None, unseen_level_handling: float | int | Literal['mean', 'median', 'min', 'max'] | None = None, return_type: Literal['Float32', 'Float64'] = 'Float32', drop_original: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,WeightColumnMixin,DropOriginalMixinConvert categorical variables to numeric by mapping levels to the mean response for level.
For a continuous or binary response the categorical columns specified will have values replaced with the mean response for each category.
For an n > 1 level categorical response, up to n binary responses can be created, which in turn can then be used to encode each categorical column specified. This will generate up to n * len(columns) new columns, of with names of the form {column}_{response_level}. The original columns will be removed from the dataframe. This functionality is controlled using the ‘level’ parameter. Note that the above only works for a n > 1 level categorical response. Do not use ‘level’ parameter for a n = 1 level numerical response. In this case, use the standard mean response transformer without the ‘level’ parameter.
If a categorical variable contains null values these will not be transformed.
The same weights and prior are applied to each response level in the multi-level case.
- columns
Categorical columns to encode in the input data.
- Type:
str or list
- weights_column
Weights column to use when calculating the mean response.
- Type:
str or None
- prior
Regularisation parameter, can be thought of roughly as the size a category should be in order for its statistics to be considered reliable (hence default value of 0 means no regularisation).
- Type:
int, default = 0
- level
Parameter to control encoding against a multi-level categorical response. If None the response will be treated as binary or continuous, if ‘all’ all response levels will be encoded against and if it is a list of levels then only the levels specified will be encoded against.
- Type:
str, int, float, list or None, default = None
- response_levels
Only created in the multi-level case. Generated from level, list of all the response levels to encode against.
- Type:
list
- mappings
Created in fit. A nested Dict of {column names : column specific mapping dictionary} pairs. Column specific mapping dictionaries contain {initial value : mapped value} pairs.
- Type:
dict
- mapped_columns
Only created in the multi-level case. A list of the new columns produced by encoded the columns in self.columns against multiple response levels, of the form {column}_{level}.
- Type:
list
- transformer_dict
Only created in the multi-level case. A dictionary of the form level : transformer containing the mean response transformers for each level to be encoded against.
- Type:
dict
- unseen_levels_encoding_dict
Dict containing the values (based on chosen unseen_level_handling) derived from the encoded columns to use when handling unseen levels in data passed to transform method.
- Type:
dict
- return_type
What type to cast return column as. Defaults to float32.
- Type:
Literal[‘float32’, ‘float64’]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... ) >>> transformer MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [0, 1]})
>>> _ = transformer.fit(test_df[["a"]], test_df["b"])
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MeanResponseTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None, 'prior': 1, 'level': None, 'unseen_level_handling': 'mean', 'return_type': 'Float32', 'drop_original': True}, 'fit': {'is_fitted_': True, 'mappings': {'a': {'x': 0.25, 'y': 0.75}}, 'return_dtypes': {'a': 'Float32'}, 'column_to_encoded_columns': {'a': ['a']}, 'encoded_columns': ['a'], 'unseen_levels_encoding_dict': {'a': 0.5}}} >>> MeanResponseTransformer.from_json(json_dump) MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame) MeanResponseTransformer[source]
Identify mapping of categorical levels to mean response values.
If the user specified the weights_column arg in when initialising the transformer the weighted mean response will be calculated using that column.
In the multi-level case this method learns which response levels are present and are to be encoded against.
- Parameters:
X (DataFrame) – Data to with catgeorical variable columns to transform and also containing response_column column.
y (Series or LazyFrame) – Response variable or target.
- Returns:
MeanResponseTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if y contains null values:
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> transformer.fit(test_df, test_df["target"]) MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() ['a']
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... level=["x", "y"], ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() ['a_x', 'a_y']
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... level="all", ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() Traceback (most recent call last): ... sklearn.exceptions.NotFittedError: ...
>>> test_df = pl.DataFrame({"a": ["x", "y", "x"], "b": ["cat", "dog", "rat"]})
>>> _ = transformer.fit(test_df, test_df["b"])
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog', 'a_rat']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer(columns=["a"])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [0, 1]})
>>> _ = transformer.fit(test_df[["a"]], test_df["b"])
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'MeanResponseTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None, 'prior': 0, 'level': None, 'unseen_level_handling': None, 'return_type': 'Float32', 'drop_original': True}, 'fit': {'is_fitted_': True, 'mappings': {'a': {'x': 0.0, 'y': 1.0}}, 'return_dtypes': {'a': 'Float32'}, 'column_to_encoded_columns': {'a': ['a']}, 'encoded_columns': ['a']}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply mean response encoding stored in the mappings attribute to columns.
- Parameters:
X (DataFrame) – Data with nominal columns to transform.
- Returns:
X – Transformed input X with levels mapped according to mappings dict.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> # example with no prior >>> transformer = MeanResponseTransformer( … columns=”a”, … prior=0, … unseen_level_handling=”mean”, … )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> _ = transformer.fit(test_df, test_df["target"])
>>> transformer.transform(test_df) shape: (2, 3) ┌─────┬─────┬────────┐ │ a ┆ b ┆ target │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞═════╪═════╪════════╡ │ 0.0 ┆ 1 ┆ 0 │ │ 1.0 ┆ 2 ┆ 1 │ └─────┴─────┴────────┘
# example with prior >>> transformer = MeanResponseTransformer( … columns=”a”, … prior=1, … unseen_level_handling=”mean”, … )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> _ = transformer.fit(test_df, test_df["target"])
>>> transformer.transform(test_df) shape: (2, 3) ┌──────┬─────┬────────┐ │ a ┆ b ┆ target │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪═════╪════════╡ │ 0.25 ┆ 1 ┆ 0 │ │ 0.75 ┆ 2 ┆ 1 │ └──────┴─────┴────────┘
- class tubular.nominal.NominalToIntegerTransformer(**kwargs)[source]
Bases:
BaseMappingTransformMixinTransformer to convert columns containing nominal values into integer values.
The nominal levels that are mapped to integers are not ordered in any way.
- start_encoding
Value to start the encoding / mapping of nominal to integer from.
- Type:
int
- mappings
Created in fit. A dict of key (column names) value (mappings between levels and integers for given column) pairs.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = True
- deprecated = True
- fit(X: pd.DataFrame, y: pd.Series | None = None) pd.DataFrame[source]
Create mapping between nominal levels and integer values for categorical variables.
- Parameters:
X (pd.DataFrame) – Data to fit the transformer on, this sets the nominal levels that can be mapped.
y (None or pd.DataFrame or pd.Series, default = None) – Optional argument only required for the transformer to work with sklearn pipelines.
- Returns:
NominalToIntegerTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if column has more levels than can be encoded as int8:
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.nominal.OneHotEncodingTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] | None = None, wanted_values: dict[str, ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]]] | None = None, separator: str = '_', drop_original: bool = False, **kwargs: bool)[source]
Bases:
DropOriginalMixin,BaseTransformerTransformer to convert categorical variables into dummy columns.
- separator
Separator used in naming for dummy columns.
- Type:
str
- drop_original
Should original columns be dropped after creating dummy fields?
- Type:
bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... ) >>> transformer OneHotEncodingTransformer(columns=['a'])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> _ = transformer.fit(test_df)
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'OneHotEncodingTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'wanted_values': None, 'separator': '_', 'drop_original': False}, 'fit': {'is_fitted_': True, 'categories_': {'a': ['x', 'y']}, 'new_feature_names_': {'a': ['a_x', 'a_y']}}}
>>> OneHotEncodingTransformer.from_json(json_dump) OneHotEncodingTransformer(columns=['a'])
- FITS = True
- MAX_LEVELS = 100
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) OneHotEncodingTransformer[source]
Get list of levels for each column to be transformed.
This defines which dummy columns will be created in transform.
- Parameters:
X (DataFrame) – Data to identify levels from.
y (None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns:
OneHotEncodingTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if column has >100 levels:
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2]})
>>> transformer.fit(test_df) OneHotEncodingTransformer(columns=['a'])
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... wanted_values={"a": ["cat", "dog"]}, ... )
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog']
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> transformer.get_feature_names_out() Traceback (most recent call last): ... sklearn.exceptions.NotFittedError: ...
>>> test_df = pl.DataFrame({"a": ["cat", "dog", "rat"]})
>>> _ = transformer.fit(test_df)
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog', 'a_rat']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer(columns=["a"])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> _ = transformer.fit(test_df)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'OneHotEncodingTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'wanted_values': None, 'separator': '_', 'drop_original': False}, 'fit': {'is_fitted_': True, 'categories_': {'a': ['x', 'y']}, 'new_feature_names_': {'a': ['a_x', 'a_y']}}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create new dummy columns from categorical fields.
- Parameters:
X (DataFrame) – Data to apply one hot encoding to.
return_native_override (Optional[bool]) – controls whether transformer returns narwhals or native type.
return_native_override
transformer (option to override return_native attr in)
parent (useful when calling)
methods
- Returns:
X_transformed – Transformed input X with dummy columns derived from categorical columns added. If drop_original = True then the original categorical columns that the dummies are created from will not be in the output X.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2]})
>>> _ = transformer.fit(test_df)
>>> transformer.transform(test_df) shape: (2, 4) ┌─────┬─────┬───────┬───────┐ │ a ┆ b ┆ a_x ┆ a_y │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ bool ┆ bool │ ╞═════╪═════╪═══════╪═══════╡ │ x ┆ 1 ┆ true ┆ false │ │ y ┆ 2 ┆ false ┆ true │ └─────┴─────┴───────┴───────┘
- class tubular.nominal.OrdinalEncoderTransformer(**kwargs)[source]
Bases:
BaseMappingTransformMixin,WeightColumnMixinEncode categorical variables into ascending rank-ordered integer values variables.
Maps levels to the target-mean response for that level.
Values will be sorted in ascending order only i.e. categorical level with lowest target mean response to be encoded as 1, the next highest value as 2 and so on.
If a categorical variable contains null values these will not be transformed.
- weights_column
Weights column to use when calculating the mean response.
- Type:
str or None
- mappings
Created in fit. Dict of key (column names) value (mapping of categorical levels to numeric, ordinal encoded response values) pairs.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = True
- deprecated = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series) OrdinalEncoderTransformer[source]
Identify mapping of categorical levels to rank-ordered integer values by target-mean in ascending order.
If the user specified the weights_column arg in when initialising the transformer the weighted mean response will be calculated using that column.
- Parameters:
X (DataFrame) – Data to with catgeorical variable columns to transform and response_column column specified when object was initialised.
y (Series or LazyFrame) – Response column or target.
- Returns:
OrdinalEncoderTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if y contains nulls:
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply ordinal encoding stored in the mappings attribute to columns.
This maps categorical levels to rank-ordered integer values by target-mean in ascending order.
- Parameters:
X (DataFrame) – Data to with catgeorical variable columns to transform.
- Returns:
X – Transformed data with levels mapped to ordinal encoded values for categorical variables.
- Return type:
DataFrame
tubular.numeric module
Contains transformers that apply numeric functions.
- class tubular.numeric.BaseNumericTransformer(columns: list[str], **kwargs: dict[str, bool])[source]
Bases:
BaseTransformer,CheckNumericMixinExtends BaseTransformer for datetime scenarios.
- columns
List of columns to be operated on
- Type:
List[str]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> BaseNumericTransformer( … columns=”a”, … ) BaseNumericTransformer(columns=[‘a’])
- FITS = False
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | None = None) BaseNumericTransformer[source]
Validate data and attributes prior to the child objects fit logic.
- Parameters:
X (DataFrame) – A dataframe containing the required columns
y (Series | None) – Required for pipeline.
- Returns:
fitted class instance.
- Return type:
Examples
```pycon >>> import polars as pl
>>> transformer = BaseNumericTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> transformer.fit(test_df) BaseNumericTransformer(columns=['a'])
- jsonable = False
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Validate data and attributes prior to the child objects transform logic.
- Parameters:
X (DataFrame) – Data to transform.
return_native_override (Optional[bool]) – Option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X – Validated data
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = BaseNumericTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> # base class has no effect on datag >>> transformer.transform(test_df) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
- class tubular.numeric.CutTransformer(**kwargs)[source]
Bases:
BaseNumericTransformerClass to bin a column into discrete intervals.
Class simply uses the [pd.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) method on the specified column.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.numeric.DifferenceTransformer(columns: ]], **kwargs: bool | None)[source]
Bases:
BaseNumericTransformerTransformer that performs subtraction operation between two columns.
This transformer allows performing subtraction between two columns in a DataFrame and stores the result in a new column.
- columns
List of exactly two column names to operate on. The second column is subtracted from the first.
- Type:
ListOfTwoStrs
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = DifferenceTransformer(columns=[“a”, “b”]) >>> transformer.columns [‘a’, ‘b’]
- FITS = False
- get_feature_names_out() list[str][source]
Get the names of the output features.
- Returns:
List containing the name of the new column created by the transformation.
- Return type:
list[str]
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the DataFrame by applying the subtraction operation between two columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to operate on.
- Returns:
Transformed DataFrame with the new column containing the subtraction results.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> transformer = DifferenceTransformer(columns=[“a”, “b”]) >>> test_df = pl.DataFrame({“a”: [100, 200, 300], “b”: [80, 150, 200]}) >>> transformer.transform(test_df) shape: (3, 3) ┌─────┬─────┬───────────┐ │ a ┆ b ┆ a_minus_b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═══════════╡ │ 100 ┆ 80 ┆ 20 │ │ 200 ┆ 150 ┆ 50 │ │ 300 ┆ 200 ┆ 100 │ └─────┴─────┴───────────┘
- class tubular.numeric.InteractionTransformer(**kwargs)[source]
Bases:
BaseNumericTransformerGenerates interaction features.
Transformer generates a new column for all combinations from the selected columns up to the maximum degree provided. (For sklearn version higher than 1.0.0>, only interaction of a degree higher or equal to the minimum degree would be computed). Each interaction column consists of the product of the specific combination of columns. Ex: with 3 columns provided [“a”,”b”,”c”], if max degree is 3, the total possible combinations are : - of degree 1 : [“a”,”b”,”c”] - of degree 2 : [“a b”,”b c”,”a c”] - of degree 3 : [“a b c”].
- min_degree
minimum degree of interaction features to be considered
- Type:
int
- max_degree
maximum degree of interaction features to be considered
- Type:
int
- nb_features_to_interact
number of selected columns from which interactions should be computed. (=len(columns))
- Type:
int
- nb_combinations
number of new interaction features
- Type:
int
- interaction_colname
names of each new interaction feature. The name of an interaction feature is the combinations of previous column names joined with a whitespace. Interaction feature of [“col1”,”col2”,”col3] would be “col1 col2 col3”.
- Type:
list
- nb_feature_out
number of total columns of transformed dataset, including new interaction features
- Type:
int
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- MIN_DEGREE_VALUE = 2
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Generate interaction features using the “product” pandas.DataFrame method.
- Parameters:
X (pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional column or columns (self.interaction_colname) added. These contain the output of running the product pandas DataFrame method on identified combinations.
- Return type:
pd.DataFrame
- Raises:
TypeError – for invalid PolynomialFeatures._combinations arguments:
- class tubular.numeric.LogTransformer(**kwargs)[source]
Bases:
BaseNumericTransformer,DropOriginalMixinTransformer to apply log transformation.
Transformer has the option to add 1 to the columns to log and drop the original columns.
- add_1
The name of the column or columns to be assigned to the output of running the pandas method in transform.
- Type:
bool
- drop_original
The name of the pandas.DataFrame method to call.
- Type:
bool
- suffix
The suffix to add onto the end of column names for new columns.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Apply the log transform to the specified columns.
If the drop attribute is True then the original columns are dropped. If the add_1 attribute is True then the original columns + 1 are logged.
- Parameters:
X (pd.DataFrame) – The dataframe to be transformed.
- Returns:
X – The dataframe with the specified columns logged, optionally dropping the original columns if self.drop is True.
- Return type:
pd.DataFrame
- Raises:
ValueError: – if provided columns contain negative values.
- class tubular.numeric.OneDKmeansTransformer(columns: str | ~typing.Annotated[list[str], beartype.vale.Is[lambda list_arg: ...]], new_column_name: str, n_init: str | int = 'auto', n_clusters: int = 8, drop_original: bool = False, kmeans_kwargs: dict[str, object] | None = None, **kwargs: bool)[source]
Bases:
BaseNumericTransformer,DropOriginalMixinGenerates a new column based on kmeans algorithm.
Transformer runs the kmeans algorithm based on given number of clusters and then identifies the bins’ cuts based on the results. Finally it passes them into the a cut function.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> OneDKmeansTransformer( … columns=”a”, … n_clusters=2, … new_column_name=”new”, … drop_original=False, … kmeans_kwargs={“random_state”: 42}, … ) OneDKmeansTransformer(columns=[‘a’], kmeans_kwargs={‘random_state’: 42},
n_clusters=2, new_column_name=’new’)
- FITS = True
- fit(X: FrameT, y: IntoSeriesT | None = None) OneDKmeansTransformer[source]
Fit transformer to input data.
- Parameters:
X (pd/pl.DataFrame) – Dataframe with columns to learn scaling values from.
y (None) – Required for pipeline.
- Returns:
Fitted class instance.
- Return type:
- Raises:
ValueError: – if columns in X contain missing values.
Examples
```pycon >>> import polars as pl
>>> transformer = OneDKmeansTransformer( ... columns="a", ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... )
>>> test_df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
>>> transformer.fit(test_df) OneDKmeansTransformer(columns=['a'], kmeans_kwargs={'random_state': 42}, n_clusters=2, new_column_name='new')
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = OneDKmeansTransformer( … columns=”a”, … n_clusters=2, … new_column_name=”kmeans_column”, … drop_original=False, … kmeans_kwargs={“random_state”: 42}, … )
>>> transformer.get_feature_names_out() ['kmeans_column']
- jsonable = True
- lazyframe_compatible = False
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
>>> import polars as pl >>> x = OneDKmeansTransformer( ... columns='a', ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... ) >>> test_df=pl.DataFrame({'a': [1,2,3,4], 'b': [5,6,7,8]}) >>> x.fit(test_df) OneDKmeansTransformer(columns=['a'], kmeans_kwargs={'random_state': 42}, n_clusters=2, new_column_name='new') >>> x.to_json() {'tubular_version': ..., 'classname': 'OneDKmeansTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'new', 'n_init': 'auto', 'n_clusters': 2, 'drop_original': False, 'kmeans_kwargs': {'random_state': 42}}, 'fit': {'is_fitted_': True, 'bins': [3, 4]}}
- transform(X: FrameT) FrameT[source]
Generate from input pd/pl.DataFrame (X) bins based on Kmeans results and add this column or columns in X.
- Parameters:
X (pl/pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional cluster column added.
- Return type:
pl/pd.DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = OneDKmeansTransformer( ... columns="a", ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... )
>>> test_df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
>>> _ = transformer.fit(test_df) >>> transformer.transform(test_df) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ new │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 5 ┆ 0 │ │ 2 ┆ 6 ┆ 0 │ │ 3 ┆ 7 ┆ 0 │ │ 4 ┆ 8 ┆ 1 │ └─────┴─────┴─────┘
- class tubular.numeric.PCATransformer(**kwargs)[source]
Bases:
BaseNumericTransformerGenerates variables using Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
It is based on sklearn class sklearn.decomposition.PCA
- pca
- Type:
PCA class from sklearn.decomposition
- n_components_
The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
- Type:
int
- feature_names_out
list of feature name representing the new dimensions.
- Type:
list or None
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = True
- deprecated = True
- fit(X: DataFrame, y: Series | None = None) DataFrame[source]
Fit PCA to input data.
- Parameters:
X (pd.DataFrame) – Dataframe with columns to learn scaling values from.
y (None) – Required for pipeline.
- Returns:
fitted class instance.
- Return type:
- Raises:
ValueError: – if n_components is invalid for data
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Generate from input pandas DataFrame (X) PCA features and add this column or columns in X.
- Parameters:
X (pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional column or columns (self.interaction_colname) added. These contain the output of running the product pandas DataFrame method on identified combinations.
- Return type:
pd.DataFrame
- class tubular.numeric.RatioTransformer(columns: ]], return_dtype: ]] = 'Float32', **kwargs: bool | None)[source]
Bases:
BaseNumericTransformerTransformer that performs division operation between two columns.
This transformer allows performing division between two columns in a DataFrame and stores the result in a new column.
- columns
List of exactly two column names to operate on. The first column is the numerator, and the second column is the denominator.
- Type:
ListOfTwoStrs
- return_dtype
The dtype of the resulting column, either ‘Float32’ or ‘Float64’.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> transformer.columns [‘a’, ‘b’] >>> transformer.return_dtype ‘Float32’
- FITS = False
- get_feature_names_out() list[str][source]
Get the names of the output features.
- Returns:
List containing the name of the new column created by the transformation.
- Return type:
list[str]
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> ratio_transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> ratio_transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘RatioTransformer’, ‘init’: {‘columns’: [‘a’, ‘b’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘return_dtype’: ‘Float32’}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the DataFrame by applying the division operation between two columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to operate on.
- Returns:
Transformed DataFrame with the new column containing the division results.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> test_df = pl.DataFrame({“a”: [100, 200, 300], “b”: [80, 150, 200]}) >>> transformer.transform(test_df) shape: (3, 3) ┌─────┬─────┬────────────────┐ │ a ┆ b ┆ a_divided_by_b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ f32 │ ╞═════╪═════╪════════════════╡ │ 100 ┆ 80 ┆ 1.25 │ │ 200 ┆ 150 ┆ 1.333333 │ │ 300 ┆ 200 ┆ 1.5 │ └─────┴─────┴────────────────┘
- class tubular.numeric.ScalingTransformer(**kwargs)[source]
Bases:
BaseNumericTransformerTransformer to perform scaling of numeric columns.
Transformer can apply min max scaling, max absolute scaling or standardisation (subtract mean and divide by std). The transformer uses the appropriate sklearn.preprocessing scaler.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = True
- deprecated = True
- fit(X: DataFrame, y: Series | None = None) ScalingTransformer[source]
Fit scaler to input data.
- Parameters:
X (pd.DataFrame) – Dataframe with columns to learn scaling values from.
y (None) – Required for pipeline.
- Returns:
fitted class instance.
- Return type:
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- scaler_options: ClassVar[dict[str, MinMaxScaler | MaxAbsScaler | StandardScaler]] = {'max_abs': <class 'sklearn.preprocessing._data.MaxAbsScaler'>, 'min_max': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'standard': <class 'sklearn.preprocessing._data.StandardScaler'>}
- class tubular.numeric.TwoColumnOperatorTransformer(**kwargs)[source]
Bases:
DataFrameMethodTransformer,BaseNumericTransformerApplies a pandas.DataFrame method to two columns (add, sub, mul, div, mod, pow).
Transformer assigns the output of the method to a new column. The method will be applied in the form (column 1)operator(column 2), so order matters (if the method does not commute). It is possible to supply other key word arguments to the transform method, which will be passed to the pandas.DataFrame method being called.
- pd_method_name
The name of the pandas.DataFrame method to be called.
- Type:
str
- columns
list containing two string items: [column1_name, column2_name] The first will be operated upon by the chosen pandas method using the second.
- Type:
list
- column2_name
The name of the 2nd column in the operation.
- Type:
str
- new_column_name
The name of the new column that the output is assigned to.
- Type:
str
- pd_method_kwargs
Dictionary of method kwargs to be passed to pandas.DataFrame method.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
tubular.strings module
Contains transformers that apply string functions.
- class tubular.strings.ExtractStringComponentsTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], by: str, return_n_components: ]], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to extract components from string columns, split by given character.
- by
character to split on
- Type:
str
- return_n_components
number of components to return
- Type:
int
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = ExtractStringComponentsTransformer( … columns=[“a”], by=”@”, return_n_components=2 … ) >>> transformer ExtractStringComponentsTransformer(by=’@’, columns=[‘a’], return_n_components=2)
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'ExtractStringComponentsTransformer', 'fit': {'is_fitted_': False}, 'init': {'by': '@', 'columns': ['a'], 'copy': False, 'return_n_components': 2, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> ExtractStringComponentsTransformer.from_json(json_dump) ExtractStringComponentsTransformer(by='@', columns=['a'], return_n_components=2)
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = ExtractStringComponentsTransformer( … columns=[“a”], by=”@”, return_n_components=2 … )
>>> transformer.get_feature_names_out() ['a_split_by_@_entry_0', 'a_split_by_@_entry_1']
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = ExtractStringComponentsTransformer( … columns=[“a”], by=”@”, return_n_components=2 … )
>>> pprint(transformer.to_json()) {'classname': 'ExtractStringComponentsTransformer', 'fit': {'is_fitted_': False}, 'init': {'by': '@', 'columns': ['a'], 'copy': False, 'return_n_components': 2, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Extract components from string columns, split by given character.
- Parameters:
X (DataFrame) – Data containing columns to extract components from.
- Returns:
X – Transformed input X with string components extracted from columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [”greg@gmail.com”, “bob@apple.net”]}) >>> transformer = ExtractStringComponentsTransformer( … columns=[“a”], by=”@”, return_n_components=2 … ) >>> transformer.transform(test_df) shape: (2, 3) ┌────────────────┬──────────────────────┬──────────────────────┐ │ a ┆ a_split_by_@_entry_0 ┆ a_split_by_@_entry_1 │ │ — ┆ — ┆ — │ │ str ┆ str ┆ str │ ╞════════════════╪══════════════════════╪══════════════════════╡ │ greg@gmail.com ┆ greg ┆ gmail.com │ │ bob@apple.net ┆ bob ┆ apple.net │ └────────────────┴──────────────────────┴──────────────────────┘
- class tubular.strings.LowerCaseTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to lower case of text columns.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = LowerCaseTransformer( … columns=[“a”], … ) >>> transformer LowerCaseTransformer(columns=[‘a’])
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'LowerCaseTransformer', 'fit': {'is_fitted_': False}, 'init': {'columns': ['a'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> LowerCaseTransformer.from_json(json_dump) LowerCaseTransformer(columns=['a'])
- FITS = False
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Lower case of text in given columns.
- Parameters:
X (DataFrame) – Data containing columns to lowercase.
- Returns:
X – Transformed input X with text lowercased in given columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [“HeLlO”, None, “ HI”]}) >>> transformer = LowerCaseTransformer(columns=”a”) >>> transformer.transform(test_df) shape: (3, 1) ┌───────┐ │ a │ │ — │ │ str │ ╞═══════╡ │ hello │ │ null │ │ hi │ └───────┘
- class tubular.strings.RemoveCharactersTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], characters: list[str], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to remove characters from text columns.
- characters
list of characters to remove from text columns.
- Type:
list[str]
- characters_formatted
characters attr formatted into regex string.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = RemoveCharactersTransformer(columns=[“a”], characters=[”\d”]) >>> transformer RemoveCharactersTransformer(characters=[’\d’], columns=[‘a’])
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'RemoveCharactersTransformer', 'fit': {'is_fitted_': False}, 'init': {'characters': ['\\d'], 'columns': ['a'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> RemoveCharactersTransformer.from_json(json_dump) RemoveCharactersTransformer(characters=['\\d'], columns=['a'])
- FITS = False
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = RemoveCharactersTransformer(columns=[“a”, “b”], characters=[“a”])
>>> pprint(transformer.to_json()) {'classname': 'RemoveCharactersTransformer', 'fit': {'is_fitted_': False}, 'init': {'characters': ['a'], 'columns': ['a', 'b'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Strip unwanted characters from specified columns.
- Parameters:
X (DataFrame) – Data containing columns to strip.
- Returns:
X – Transformed input X with characters stripped from specified columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [” 8hi!”, None, “9999hello “]}) >>> transformer = RemoveCharactersTransformer(columns=[“a”], characters=[”W”, “s”]) >>> transformer.transform(test_df) shape: (3, 1) ┌───────────┐ │ a │ │ — │ │ str │ ╞═══════════╡ │ 8hi │ │ null │ │ 9999hello │ └───────────┘
- class tubular.strings.SeriesStrMethodTransformer(**kwargs)[source]
Bases:
BaseTransformerTransformer that applies a pandas.Series.str method.
Transformer assigns the output of the method to a new column. It is possible to supply other key word arguments to the transform method, which will be passed to the pandas.Series.str method being called.
Be aware it is possible to supply incompatible arguments to init that will only be identified when transform is run. This is because there are many combinations of method, input and output sizes. Additionally some methods may only work as expected when called in transform with specific key word arguments.
- new_column_name
The name of the column or columns to be assigned to the output of running the pd.Series.str in transform.
- Type:
str
- pd_method_name
The name of the pd.Series.str method to call.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- transform(X: DataFrame) DataFrame[source]
Apply given pandas.Series.str method to given column.
Any keyword arguments set in the pd_method_kwargs attribute are passed onto the pd.Series.str method when calling it.
- Parameters:
X (pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional column (self.new_column_name) added. These contain the output of running the pd.Series.str method.
- Return type:
pd.DataFrame
- class tubular.strings.StringConcatenator(**kwargs)[source]
Bases:
BaseTransformerTransformer to combine data from specified columns, of mixed datatypes, into a new column containing one string.
- Parameters:
columns (str or list of str) – Columns to concatenate.
new_column_name (str, default = "new_column") – New column name
separator (str, default = " ") – Separator for the new string value
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- deprecated = True
- jsonable = False
- lazyframe_compatible = False
- polars_compatible = False
- class tubular.strings.StringContainsTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], reference: str, reference_as_column: bool = False, **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to indicate if given columns contain reference values.
- reference
column or value to compare against, e.g. look for values of reference=’a’ in columns [‘b’, ‘c’].
- Type:
str
- reference_as_column
indicates whether reference represents a column (or value). Note, reference_as_column=True is not supported for pandas backend.
- Type:
bool
- characters_formatted
characters attr formatted into regex string.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = StringContainsTransformer( … columns=[“a”], reference=”b”, reference_as_column=True … ) >>> transformer StringContainsTransformer(columns=[‘a’], reference=’b’,
reference_as_column=True)
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'StringContainsTransformer', 'fit': {'is_fitted_': False}, 'init': {'columns': ['a'], 'copy': False, 'reference': 'b', 'reference_as_column': True, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> StringContainsTransformer.from_json(json_dump) StringContainsTransformer(columns=['a'], reference='b', reference_as_column=True)
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = StringContainsTransformer(columns=[“a”, “b”], reference=”c”)
>>> transformer.get_feature_names_out() ['a_contains_c', 'b_contains_c']
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = StringContainsTransformer( … columns=[“a”], reference=”b”, reference_as_column=True … )
>>> pprint(transformer.to_json()) {'classname': 'StringContainsTransformer', 'fit': {'is_fitted_': False}, 'init': {'columns': ['a'], 'copy': False, 'reference': 'b', 'reference_as_column': True, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Indicate if provided columns contain reference values.
- Parameters:
X (DataFrame) – Data containing columns to strip.
- Returns:
X – Transformed input X with characters stripped from specified columns.
- Return type:
DataFrame
- Raises:
TypeError – if called on pandas df when reference_as_column=True:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame( … {“a”: [“cat”, “dog”, None, “mouse”], “b”: [“cat”, “rat”, None, “mouse”]} … ) >>> transformer = StringContainsTransformer( … columns=[“a”], reference=”b”, reference_as_column=True … ) >>> transformer.transform(test_df) shape: (4, 3) ┌───────┬───────┬──────────────┐ │ a ┆ b ┆ a_contains_b │ │ — ┆ — ┆ — │ │ str ┆ str ┆ bool │ ╞═══════╪═══════╪══════════════╡ │ cat ┆ cat ┆ true │ │ dog ┆ rat ┆ false │ │ null ┆ null ┆ null │ │ mouse ┆ mouse ┆ true │ └───────┴───────┴──────────────┘
Module contents
Initialise classes exposed by package.
- class tubular.AggregateColumnsOverRowTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], aggregations: ]], drop_original: bool = False, **kwargs: bool)[source]
Bases:
BaseAggregationTransformerAggregate provided columns over each row.
This transformer aggregates data within specified columns and can optionally drop the original columns post-transformation.
Attributes:
- columnsUnion[str,list[str]]
List of column names to apply the aggregation transformations to.
- aggregationslist[str]
List of aggregation methods to apply.
- drop_originalbool, optional
Whether to drop the original columns after transformation. Default is False.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatible: bool
Indicates if transformer will work with polars frames
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> AggregateColumnsOverRowTransformer( … columns=[“a”, “b”], … aggregations=[“min”, “max”], … ) AggregateColumnsOverRowTransformer(aggregations=[‘min’, ‘max’],
columns=[‘a’, ‘b’])
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = AggregateColumnsOverRowTransformer( … columns=[“a”, “b”], … aggregations=[“min”, “max”], … )
>>> transformer.get_feature_names_out() ['a_b_min', 'a_b_max']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the dataframe by aggregating provided columns over each row.
- Parameters:
X (DataFrame) – DataFrame to transform by aggregating provided columns over each row
- Returns:
DataFrame – Transformed DataFrame with aggregated columns.
Example
——–
>>> import polars as pl
>>> transformer = AggregateColumnsOverRowTransformer(
… columns=[“a”, “b”],
… aggregations=[“min”, “max”],
… )
>>> test_df = pl.DataFrame({“a” ([1, 2], “b”: [3, 4], “c”: [5, 6]}))
>>> transformer.transform(test_df)
shape ((2, 5))
┌─────┬─────┬─────┬─────────┬─────────┐
│ a ┆ b ┆ c ┆ a_b_min ┆ a_b_max │
│ — ┆ — ┆ — ┆ — ┆ — │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════════╪═════════╡
│ 1 ┆ 3 ┆ 5 ┆ 1 ┆ 3 │
│ 2 ┆ 4 ┆ 6 ┆ 2 ┆ 4 │
└─────┴─────┴─────┴─────────┴─────────┘
- class tubular.AggregateRowsOverColumnTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], aggregations: ]], key: str, drop_original: bool = False, **kwargs: bool)[source]
Bases:
BaseAggregationTransformerAggregation transformer.
Aggregate rows over specified columns, where rows are grouped by provided key column.
Attributes:
- columnsUnion[str, list[str]]
List of column names to apply the aggregation transformations to.
- aggregationslist[str]
List of aggregation methods to apply.
- keystr
Column name to group by for aggregation.
- drop_originalbool, optional
Whether to drop the original columns after transformation. Default is False.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatible: bool
Indicates if transformer will work with polars frames
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> AggregateRowsOverColumnTransformer( … columns=”a”, … aggregations=[“min”, “max”], … key=”b”, … ) AggregateRowsOverColumnTransformer(aggregations=[‘min’, ‘max’], columns=[‘a’],
key=’b’)
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = AggregateRowsOverColumnTransformer( … columns=”a”, … aggregations=[“min”, “max”], … key=”b”, … )
>>> transformer.get_feature_names_out() ['a_min', 'a_max']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, Any][source]
Dump transformer to json dict.
Returns:
- dict[str, Any]:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
Example:
```pycon >>> transformer = AggregateRowsOverColumnTransformer( … columns=”a”, … key=”c”, … aggregations=[“min”, “max”], … ) >>> transformer.to_json() # doctest: +NORMALIZE_WHITESPACE {‘tubular_version’: …,
‘classname’: ‘AggregateRowsOverColumnTransformer’, ‘init’: {‘columns’: [‘a’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘aggregations’: [‘min’, ‘max’], ‘drop_original’: False, ‘key’: ‘c’}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the dataframe by aggregating rows over specified columns.
- Parameters:
X (DataFrame) – DataFrame to transform by aggregating specified columns.
- Returns:
Transformed DataFrame with aggregated columns.
- Return type:
DataFrame
- Raises:
ValueError – If the key column is not found in the DataFrame.
Examples
```pycon >>> import polars as pl
>>> transformer = AggregateRowsOverColumnTransformer( ... columns="a", ... aggregations=["min", "max"], ... key="b", ... )
>>> test_df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 1, 2], "c": [1, 2, 3]})
>>> transformer.transform(test_df) shape: (3, 5) ┌─────┬─────┬─────┬───────┬───────┐ │ a ┆ b ┆ c ┆ a_min ┆ a_max │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪═══════╡ │ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 2 │ │ 2 ┆ 1 ┆ 2 ┆ 1 ┆ 2 │ │ 3 ┆ 2 ┆ 3 ┆ 3 ┆ 3 │ └─────┴─────┴─────┴───────┴───────┘
- class tubular.ArbitraryImputer(impute_value: int | float | str | bool, columns: str | list[str], **kwargs: bool | None)[source]
Bases:
BaseImputerTransformer to impute null values with an arbitrary pre-defined value.
- impute_value
Value to impute nulls with.
- Type:
int or float or str or bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> arbitrary_imputer = ArbitraryImputer(columns=[“a”, “b”], impute_value=5) >>> arbitrary_imputer ArbitraryImputer(columns=[‘a’, ‘b’], impute_value=5)
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = arbitrary_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ArbitraryImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'impute_value': 5}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 5, 'b': 5}}}
>>> ArbitraryImputer.from_json(json_dump) ArbitraryImputer(columns=['a', 'b'], impute_value=5)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Impute missing values with the supplied impute_value.
- Parameters:
X (DataFrame) – Data containing columns to impute.
- Returns:
X (DataFrame) – Transformed input X with nulls imputed with the specified impute_value, for the specified columns.
Example
——–
>>> import polars as pl
>>> test_df = pl.DataFrame({“a” ([1, None, 2], “b”: [3, None, 4]}))
>>> imputer = ArbitraryImputer(columns=[“a”, “b”], impute_value=5)
>>> imputer.transform(test_df)
shape ((3, 2))
┌─────┬─────┐
│ a ┆ b │
│ — ┆ — │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 5 ┆ 5 │
│ 2 ┆ 4 │
└─────┴─────┘
- class tubular.BetweenDatesTransformer(columns: ]], new_column_name: str, drop_original: bool = False, lower_inclusive: bool = True, upper_inclusive: bool = True, **kwargs: bool)[source]
Bases:
BaseGenericDateTransformerTransformer to generate a boolean column indicating if one date is between two others.
If any row has column_lower greater than column_upper, the output column for that row will be null instead of raising a warning.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- column_lowerstr
Name of date column to subtract. This attribute is not for use in any method, use ‘columns’ instead. Here only as a fix to allow string representation of transformer.
- column_upperstr
Name of date column to subtract from. This attribute is not for use in any method, use ‘columns instead. Here only as a fix to allow string representation of transformer.
- column_betweenstr
Name of column to check if it’s values fall between column_lower and column_upper. This attribute is not for use in any method, use ‘columns instead. Here only as a fix to allow string representation of transformer.
- columnslist
Contains the names of the columns to compare in the order [column_lower, column_between column_upper].
- new_column_namestr
new_column_name argument passed when initialising the transformer.
- lower_inclusivebool
lower_inclusive argument passed when initialising the transformer.
- upper_inclusivebool
upper_inclusive argument passed when initialising the transformer.
- drop_original: bool
indicates whether to drop original columns.
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> BetweenDatesTransformer( … columns=[“a”, “b”, “c”], … new_column_name=”b_between_a_c”, … lower_inclusive=True, … upper_inclusive=True, … ) BetweenDatesTransformer(columns=[‘a’, ‘b’, ‘c’],
new_column_name=’b_between_a_c’)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = BetweenDatesTransformer( … columns=[“a”, “b”, “c”], … new_column_name=”b_between_a_c”, … lower_inclusive=True, … upper_inclusive=False, … ) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘BetweenDatesTransformer’, ‘init’: {‘columns’: [‘a’, ‘b’, ‘c’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘new_column_name’: ‘b_between_a_c’, ‘drop_original’: False, ‘lower_inclusive’: True, ‘upper_inclusive’: False}, ‘fit’: {’is_fitted_’: True}}
- transform(X: FrameT) FrameT[source]
Transform - creates column indicating if middle date is between the other two.
Rows where the lower bound is greater than the upper bound will produce null in the resulting output column for that row.
- Parameters:
X (pd/pl/nw.DataFrame) – Data to transform.
- Returns:
X (pd/pl/nw.DataFrame) – Input X with additional column (self.new_column_name) added. This column is boolean and indicates if the middle column is between the other 2.
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = BetweenDatesTransformer(
… columns=[“a”, “b”, “c”],
… new_column_name=”b_between_a_c”,
… lower_inclusive=True,
… upper_inclusive=True,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([)
… datetime.date(1990, 9, 27),
… datetime.date(2005, 10, 7),
… datetime.date(2010, 1, 1),
… ],
… “b” ([)
… datetime.date(1991, 5, 22),
… datetime.date(2001, 12, 10),
… datetime.date(2009, 1, 1),
… ],
… “c” ([)
… datetime.date(1993, 4, 20),
… datetime.date(2007, 11, 8),
… datetime.date(2008, 1, 1),
… ],
… },
… )
>>> transformer.transform(test_df)
shape ((3, 4))
┌────────────┬────────────┬────────────┬───────────────┐
│ a ┆ b ┆ c ┆ b_between_a_c │
│ — ┆ — ┆ — ┆ — │
│ date ┆ date ┆ date ┆ bool │
╞════════════╪════════════╪════════════╪═══════════════╡
│ 1990-09-27 ┆ 1991-05-22 ┆ 1993-04-20 ┆ true │
│ 2005-10-07 ┆ 2001-12-10 ┆ 2007-11-08 ┆ false │
│ 2010-01-01 ┆ 2009-01-01 ┆ 2008-01-01 ┆ null │
└────────────┴────────────┴────────────┴───────────────┘
- class tubular.CappingTransformer(capping_values: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, quantiles: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseCappingTransformerTransformer to cap numeric values at both or either minimum and maximum values.
For max capping any values above the cap value will be set to the cap. Similarly for min capping any values below the cap will be set to the cap. Only works for numeric columns.
Attributes:
- capping_valuesdict[str, CappingValues] or None
Capping values to apply to each column, capping_values argument.
- quantilesdict[str, CappingValues] or None
Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.
- quantile_capping_valuesdict[str, CappingValues] or None
Capping values learned from quantiles (if provided) to apply to each column.
- weights_columnstr or None
weights_column argument.
- _replacement_valuesdict[str, CappingValues]
Replacement values when capping is applied. Will be a copy of capping_values.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> import polars as pl
>>> transformer = CappingTransformer( ... capping_values={"a": [10, 20], "b": [1, 3]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.transform(test_df) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 10 ┆ 3 ┆ 1 │ │ 15 ┆ 2 ┆ 2 │ │ 18 ┆ 3 ┆ 3 │ │ 20 ┆ 1 ┆ 4 │ └─────┴─────┴─────┘
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'CappingTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'capping_values': {'a': [10, 20], 'b': [1, 3]}, 'quantiles': None, 'weights_column': None}, 'fit': {'is_fitted_': False}}
>>> CappingTransformer.from_json(json_dump) CappingTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) CappingTransformer[source]
Learn capping values from input data X.
Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the capping_values attribute.
- Parameters:
X (DataFrame) – A dataframe with required columns to be capped.
y (None) – Required for pipeline.
- Returns:
CappingTransformer
- Return type:
fitted instance of class
Example
```pycon >>> import polars as pl
>>> transformer = CappingTransformer( ... quantiles={"a": [0.01, 0.99], "b": [0.05, 0.95]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.fit(test_df) CappingTransformer(quantiles={'a': [0.01, 0.99], 'b': [0.05, 0.95]})
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.ColumnDtypeSetter(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], dtype: ]], **kwargs: bool)[source]
Bases:
BaseTransformerTransformer to set transform columns in a dataframe to a dtype.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
- deprecated
indicates if class has been deprecated
- Type:
bool
- FITS = False
- deprecated = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = ColumnDtypeSetter(columns=”a”, dtype=”Float32”) >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘ColumnDtypeSetter’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’],
‘copy’: False, ‘dtype’: ‘Float32’, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform data.
- Parameters:
X (DataFrame) – data to transform.
- Returns:
DataFrame
- Return type:
transformed data
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2]}) >>> transformer = ColumnDtypeSetter(columns=”a”, dtype=”Float32”) >>> transformer.transform(df) shape: (2, 1) ┌─────┐ │ a │ │ — │ │ f32 │ ╞═════╡ │ 1.0 │ │ 2.0 │ └─────┘
- class tubular.CompareTwoColumnsTransformer(columns: ]], condition: ]], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer to compare two columns and generate outcomes based on conditions.
This transformer evaluates a condition between two columns and generates an outcome based on the result.
- polars_compatible
Indicates whether transformer has been converted to polars/pandas agnostic narwhals framework.
- Type:
bool
- FITS
Indicates whether transform requires fit to be run first.
- Type:
bool
- jsonable
Indicates if transformer supports to/from_json methods.
- Type:
bool
- lazyframe_compatible
Indicates whether transformer works with lazyframes.
- Type:
bool
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2, 3], “b”: [3, 2, 1]}) >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=”>”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 3) ┌─────┬─────┬───────┐ │ a ┆ b ┆ a>b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool │ ╞═════╪═════╪═══════╡ │ 1 ┆ 3 ┆ false │ │ 2 ┆ 2 ┆ false │ │ 3 ┆ 1 ┆ true │ └─────┴─────┴───────┘
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- ops_map: ClassVar[dict[ConditionEnum, Any]] = {ConditionEnum.EQUAL_TO: <built-in function eq>, ConditionEnum.GREATER_THAN: <built-in function gt>, ConditionEnum.LESS_THAN: <built-in function lt>, ConditionEnum.NOT_EQUAL_TO: <built-in function ne>}
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=ConditionEnum.GREATER_THAN.value, … ) >>> json_dict = transformer.to_json() >>> from pprint import pprint >>> pprint(json_dict, sort_dicts=True) {‘classname’: ‘CompareTwoColumnsTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’, ‘b’],
‘condition’: ‘>’, ‘copy’: False, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform two columns based on a condition to generate an outcome.
- Parameters:
X (DataFrame) – DataFrame containing the columns to be transformed.
- Returns:
Transformed DataFrame with the new outcome column.
- Return type:
DataFrame
- Raises:
TypeError – If the columns are not of a numeric type.
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame({“a”: [1, 2, 3], “b”: [3, 2, 1]}) >>> transformer = CompareTwoColumnsTransformer( … columns=[“a”, “b”], … condition=”>”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 3) ┌─────┬─────┬───────┐ │ a ┆ b ┆ a>b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool │ ╞═════╪═════╪═══════╡ │ 1 ┆ 3 ┆ false │ │ 2 ┆ 2 ┆ false │ │ 3 ┆ 1 ┆ true │ └─────┴─────┴───────┘
- class tubular.DateDifferenceTransformer(columns: ]], new_column_name: str, units: ]] = 'D', drop_original: bool = False, custom_days_divider: int | None = None, **kwargs: bool)[source]
Bases:
BaseGenericDateTransformerClass to transform calculate the difference between 2 date fields in specified units.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = DateDifferenceTransformer( … columns=[“a”, “b”], … new_column_name=”bla”, … units=”common_year”, … ) >>> transformer DateDifferenceTransformer(columns=[‘a’, ‘b’], new_column_name=’bla’,
units=’common_year’)
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'DateDifferenceTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'bla', 'drop_original': False, 'units': 'common_year', 'custom_days_divider': None}, 'fit': {'is_fitted_': True}}
>>> DateDifferenceTransformer.from_json(json_dump) DateDifferenceTransformer(columns=['a', 'b'], new_column_name='bla', units='common_year')
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = DateDifferenceTransformer(columns=[“a”, “b”], new_column_name=”a_diff_b”)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'DateDifferenceTransformer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'a_diff_b', 'drop_original': False, 'units': 'D', 'custom_days_divider': None}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Calculate the difference between the given fields in the specified units.
- Parameters:
X (DataFrame) – Data containing self.columns
- Returns:
dataframe with added date difference column
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> import datetime
>>> transformer = DateDifferenceTransformer( ... columns=["a", "b"], ... new_column_name="a_b_difference_years", ... units="common_year", ... )
>>> test_df = pl.DataFrame( ... { ... "a": [datetime.date(1993, 9, 27), datetime.date(2005, 10, 7)], ... "b": [datetime.date(1991, 5, 22), datetime.date(2001, 12, 10)], ... }, ... )
>>> transformer.transform(test_df) shape: (2, 3) ┌────────────┬────────────┬──────────────────────┐ │ a ┆ b ┆ a_b_difference_years │ │ --- ┆ --- ┆ --- │ │ date ┆ date ┆ f64 │ ╞════════════╪════════════╪══════════════════════╡ │ 1993-09-27 ┆ 1991-05-22 ┆ -2.353425 │ │ 2005-10-07 ┆ 2001-12-10 ┆ -3.827397 │ └────────────┴────────────┴──────────────────────┘
- class tubular.DatetimeComponentExtractor(columns: str | list[str], include: ]], **kwargs: str | bool)[source]
Bases:
BaseDatetimeTransformerTransformer to extract numeric datetime components.
Attributes:
- columns: List[str]
List of columns for processing
- includelist of str
Which numeric datetime components to extract
- polars_compatiblebool
Indicates whether transformer has been converted to polars/pandas agnostic framework
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
- jsonable: bool
Indicates if transformer supports to/from_json methods
- FITS: bool
Indicates whether transform requires fit to be run first
Example:
```pycon >>> transformer = DatetimeComponentExtractor( … columns=”a”, … include=[“hour”, “day”], … ) >>> transformer DatetimeComponentExtractor(columns=[‘a’], include=[‘hour’, ‘day’])
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'DatetimeComponentExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['hour', 'day']}, 'fit': {'is_fitted_': True}}
>>> DatetimeComponentExtractor.from_json(json_dump) DatetimeComponentExtractor(columns=['a'], include=['hour', 'day'])
- FITS = False
- INCLUDE_OPTIONS: ClassVar[list[str]] = ['hour', 'day', 'month', 'year']
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
List of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeComponentExtractor( … columns=[“a”, “b”], … include=[“hour”, “day”], … )
>>> transformer.get_feature_names_out() ['a_hour', 'a_day', 'b_hour', 'b_day']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, Any][source]
Convert transformer to JSON format.
- Returns:
JSON representation of the transformer
- Return type:
dict
Examples
```pycon >>> transformer = DatetimeComponentExtractor( … columns=”a”, … include=[“hour”, “day”], … )
>>> transformer.to_json() {'tubular_version': '...', 'classname': 'DatetimeComponentExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['hour', 'day']}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - Extracts numeric datetime components.
- Parameters:
X (DataFrame) – Data with columns to extract info from.
- Returns:
X – Transformed input X with added columns of extracted information.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> import datetime
>>> transformer = DatetimeComponentExtractor( ... columns="a", ... include=["hour", "day"], ... )
>>> test_df = pl.DataFrame( ... { ... "a": [ ... datetime.datetime(1993, 9, 27, 14, 30), ... datetime.datetime(2005, 10, 7, 9, 45), ... ], ... "b": [ ... datetime.datetime(1991, 5, 22, 18, 0), ... datetime.datetime(2001, 12, 10, 23, 59), ... ], ... }, ... )
>>> transformer.transform(test_df) shape: (2, 4) ┌─────────────────────┬─────────────────────┬────────┬───────┐ │ a ┆ b ┆ a_hour ┆ a_day │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ f32 ┆ f32 │ ╞═════════════════════╪═════════════════════╪════════╪═══════╡ │ 1993-09-27 14:30:00 ┆ 1991-05-22 18:00:00 ┆ 14.0 ┆ 27.0 │ │ 2005-10-07 09:45:00 ┆ 2001-12-10 23:59:00 ┆ 9.0 ┆ 7.0 │ └─────────────────────┴─────────────────────┴────────┴───────┘
- class tubular.DatetimeInfoExtractor(columns: str | list[str], include: ]] | None = None, datetime_mappings: dict[~typing.Annotated[str, beartype.vale.Is[lambda s: ...]], dict[int, str]] | None = None, drop_original: bool | None = False, **kwargs: str | bool)[source]
Bases:
BaseDatetimeTransformerTransformer to extract various features from datetime var.
Attributes:
- columns: List[str]
List of columns for processing
- includelist of str, default = [“timeofday”, “timeofmonth”, “timeofyear”, “dayofweek”]
Which datetime categorical information to extract
- datetime_mappingsdict, default = None
Optional argument to define custom mappings for datetime values.
- drop_original: str
indicates whether to drop provided columns post transform
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = DatetimeInfoExtractor( … columns=”a”, … include=”timeofday”, … ) >>> transformer DatetimeInfoExtractor(columns=[‘a’], datetime_mappings={},
include=[‘timeofday’])
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'DatetimeInfoExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['timeofday'], 'datetime_mappings': {}}, 'fit': {'is_fitted_': True}}
- DATETIME_ATTR: ClassVar[dict[str, str]] = {'dayofweek': 'weekday', 'timeofday': 'hour', 'timeofmonth': 'day', 'timeofyear': 'month'}
- DEFAULT_MAPPINGS: ClassVar[dict[str, dict[int, str]]] = {'dayofweek': {1: 'monday', 2: 'tuesday', 3: 'wednesday', 4: 'thursday', 5: 'friday', 6: 'saturday', 7: 'sunday'}, 'timeofday': {0: 'night', 1: 'night', 2: 'night', 3: 'night', 4: 'night', 5: 'night', 6: 'morning', 7: 'morning', 8: 'morning', 9: 'morning', 10: 'morning', 11: 'morning', 12: 'afternoon', 13: 'afternoon', 14: 'afternoon', 15: 'afternoon', 16: 'afternoon', 17: 'afternoon', 18: 'evening', 19: 'evening', 20: 'evening', 21: 'evening', 22: 'evening', 23: 'evening'}, 'timeofmonth': {1: 'start', 2: 'start', 3: 'start', 4: 'start', 5: 'start', 6: 'start', 7: 'start', 8: 'start', 9: 'start', 10: 'start', 11: 'middle', 12: 'middle', 13: 'middle', 14: 'middle', 15: 'middle', 16: 'middle', 17: 'middle', 18: 'middle', 19: 'middle', 20: 'middle', 21: 'end', 22: 'end', 23: 'end', 24: 'end', 25: 'end', 26: 'end', 27: 'end', 28: 'end', 29: 'end', 30: 'end', 31: 'end'}, 'timeofyear': {1: 'winter', 2: 'winter', 3: 'spring', 4: 'spring', 5: 'spring', 6: 'summer', 7: 'summer', 8: 'summer', 9: 'autumn', 10: 'autumn', 11: 'autumn', 12: 'winter'}}
- FITS = False
- INCLUDE_OPTIONS: ClassVar[list[str]] = ['timeofday', 'timeofmonth', 'timeofyear', 'dayofweek']
- RANGE_TO_MAP: ClassVar[dict[str, set[int]]] = {'dayofweek': {1, 2, 3, 4, 5, 6, 7}, 'timeofday': {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}, 'timeofmonth': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}, 'timeofyear': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}}
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeInfoExtractor( … columns=[“a”, “b”], … include=[“timeofday”, “timeofmonth”], … )
>>> transformer.get_feature_names_out() ['a_timeofday', 'a_timeofmonth', 'b_timeofday', 'b_timeofmonth']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
>>> transformer=DatetimeInfoExtractor(columns='a')
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'DatetimeInfoExtractor', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'dummy', 'drop_original': False, 'include': ['timeofday', 'timeofmonth', 'timeofyear', 'dayofweek'], 'datetime_mappings': {}}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - Extracts new features from datetime variables.
- Parameters:
X (DataFrame) – Data with columns to extract info from.
- Returns:
X (DataFrame) – Transformed input X with added columns of extracted information.
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = DatetimeInfoExtractor(
… columns=”a”,
… include=”timeofmonth”,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([datetime.datetime(1993, 9, 27), datetime.datetime(2005, 10, 7)],)
… “b” ([datetime.datetime(1991, 5, 22), datetime.datetime(2001, 12, 10)],)
… },
… )
>>> transformer.transform(test_df)
shape ((2, 3))
┌─────────────────────┬─────────────────────┬───────────────┐
│ a ┆ b ┆ a_timeofmonth │
│ — ┆ — ┆ — │
│ datetime[μs] ┆ datetime[μs] ┆ enum │
╞═════════════════════╪═════════════════════╪═══════════════╡
│ 1993-09-27 00 (00:00 ┆ 1991-05-22 00:00:00 ┆ end │)
│ 2005-10-07 00 (00:00 ┆ 2001-12-10 00:00:00 ┆ start │)
└─────────────────────┴─────────────────────┴───────────────┘
- class tubular.DatetimeSinusoidCalculator(columns: str | list[str], method: ]], units: ]]], period: ]]] = 6.283185307179586, drop_original: bool = False, **kwargs: bool | str)[source]
Bases:
BaseDatetimeTransformerCalculate the sine or cosine of a datetime column in a given unit (e.g hour).
Includes the option to scale period of the sine or cosine to match the natural period of the unit (e.g. 24).
Attributes:
- columnsstr or list
Columns to take the sine or cosine of.
- methodstr or list
The function to be calculated; either sin, cos or a list containing both.
- unitsstr or dict
Which time unit the calculation is to be carried out on. Will take any of ‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘microsecond’. Can be a string or a dict containing key-value pairs of column name and units to be used for that column.
- periodstr, float or dict, default = 2*np.pi
The period of the output in the units specified above. Can be a string or a dict containing key-value pairs of column name and units to be used for that column.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … ) DatetimeSinusoidCalculator(columns=[‘a’], method=[‘sin’], units=’month’)
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … )
>>> transformer.get_feature_names_out() ['sin_6.283185307179586_month_a']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = DatetimeSinusoidCalculator( … columns=”a”, … method=”sin”, … units=”month”, … ) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘DatetimeSinusoidCalculator’, ‘init’: {‘columns’: [‘a’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘new_column_name’: ‘dummy’, ‘drop_original’: False, ‘method’: [‘sin’], ‘units’: ‘month’, ‘period’: 6.283185307179586}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform - creates column containing sine or cosine of another datetime column.
Which function is used is stored in the self.method attribute.
- Parameters:
X (pd/pl/nw.DataFrame) – Data to transform.
return_native_override (Optional[bool]) – Option to override return_native attr in transformer, useful when calling parent methods
- Returns:
X (pd/pl/nw.DataFrame) – Input X with additional columns added, these are named “<method>_<original_column>”
Example
——–
>>> import polars as pl
>>> import datetime
>>> transformer = DatetimeSinusoidCalculator(
… columns=”a”,
… method=”sin”,
… units=”month”,
… )
>>> test_df = pl.DataFrame(
… {
… “a” ([datetime.datetime(1993, 9, 27), datetime.datetime(2005, 10, 7)],)
… “b” ([datetime.datetime(1991, 5, 22), datetime.datetime(2001, 12, 10)],)
… },
… )
>>> transformer.transform(test_df)
shape ((2, 3))
┌─────────────────────┬─────────────────────┬───────────────────────────────┐
│ a ┆ b ┆ sin_6.283185307179586_month_a │
│ — ┆ — ┆ — │
│ datetime[μs] ┆ datetime[μs] ┆ f64 │
╞═════════════════════╪═════════════════════╪═══════════════════════════════╡
│ 1993-09-27 00 (00:00 ┆ 1991-05-22 00:00:00 ┆ 0.412118 │)
│ 2005-10-07 00 (00:00 ┆ 2001-12-10 00:00:00 ┆ -0.544021 │)
└─────────────────────┴─────────────────────┴───────────────────────────────┘
- class tubular.DifferenceTransformer(columns: ]], **kwargs: bool | None)[source]
Bases:
BaseNumericTransformerTransformer that performs subtraction operation between two columns.
This transformer allows performing subtraction between two columns in a DataFrame and stores the result in a new column.
- columns
List of exactly two column names to operate on. The second column is subtracted from the first.
- Type:
ListOfTwoStrs
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = DifferenceTransformer(columns=[“a”, “b”]) >>> transformer.columns [‘a’, ‘b’]
- FITS = False
- get_feature_names_out() list[str][source]
Get the names of the output features.
- Returns:
List containing the name of the new column created by the transformation.
- Return type:
list[str]
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the DataFrame by applying the subtraction operation between two columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to operate on.
- Returns:
Transformed DataFrame with the new column containing the subtraction results.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> transformer = DifferenceTransformer(columns=[“a”, “b”]) >>> test_df = pl.DataFrame({“a”: [100, 200, 300], “b”: [80, 150, 200]}) >>> transformer.transform(test_df) shape: (3, 3) ┌─────┬─────┬───────────┐ │ a ┆ b ┆ a_minus_b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═══════════╡ │ 100 ┆ 80 ┆ 20 │ │ 200 ┆ 150 ┆ 50 │ │ 300 ┆ 200 ┆ 100 │ └─────┴─────┴───────────┘
- class tubular.GroupRareLevelsTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] | None = None, cut_off_percent: ]] = 0.01, weights_column: str | None = None, rare_level_name: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] = 'rare', record_rare_levels: bool = True, unseen_levels_to_rare: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,WeightColumnMixinGroup together rare levels of nominal variables into a new rare level.
Rare levels are defined by a cut off percentage, which can either be based on the number of rows or sum of weights. Any levels below this cut off value will be grouped into the rare level.
- cut_off_percent
Cut off percentage (either in terms of number of rows or sum of weight) for a given nominal level to be considered rare.
- Type:
float
- non_rare_levels
Created in fit. A dict of non-rare levels (i.e. levels with more than cut_off_percent weight or rows) that is used to identify rare levels in transform.
- Type:
dict
- rare_level_name
Must be of the same type as columns. Label for the new nominal level that will be added to group together rare levels (as defined by cut_off_percent).
- Type:
any
- record_rare_levels
Should the ‘rare’ levels that will be grouped together be recorded? If not they will be lost after the fit and the only information remaining will be the ‘non’rare’ levels.
- Type:
bool
- rare_levels_record
Only created (in fit) if record_rare_levels is True. This is dict containing a list of levels that were grouped into ‘rare’ for each column the transformer was applied to.
- Type:
dict
- weights_column
Name of weights columns to use if cut_off_percent should be in terms of sum of weight not number of rows.
- Type:
str
- unseen_levels_to_rare
If True, unseen levels in new data will be passed to rare, if set to false they will be left unchanged.
- Type:
bool
- training_data_levels
Dictionary containing the set of values present in the training data for each column in self.columns. It will only exist in if unseen_levels_to_rare is set to False.
- Type:
dict[set]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> GroupRareLevelsTransformer( … columns=”a”, … cut_off_percent=0.02, … rare_level_name=”rare_level”, … ) GroupRareLevelsTransformer(columns=[‘a’], cut_off_percent=0.02,
rare_level_name=’rare_level’)
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) GroupRareLevelsTransformer[source]
Record non-rare levels for categorical variables.
When transform is called, only levels records in non_rare_levels during fit will remain unchanged - all other levels will be grouped. If record_rare_levels is True then the rare levels will also be recorded.
The label for the rare levels must be of the same type as the columns.
- Parameters:
X (DataFrame) – Data to identify non-rare levels from.
y (Series or LazyFrame or None, default = None) – Optional argument only required for the transformer to work with sklearn pipelines.
- Returns:
GroupRareLevelsTransformer
- Return type:
fitted class instance
Examples
```pycon >>> import polars as pl
>>> transformer = GroupRareLevelsTransformer( ... columns="a", ... cut_off_percent=0.02, ... rare_level_name="rare_level", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> transformer.fit(test_df) GroupRareLevelsTransformer(columns=['a'], cut_off_percent=0.02, rare_level_name='rare_level')
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import tests.test_data as d
>>> df = d.create_df_8("pandas")
>>> x = GroupRareLevelsTransformer( ... columns=["b", "c"], cut_off_percent=0.4, unseen_levels_to_rare=False ... )
>>> x.fit(df) GroupRareLevelsTransformer(columns=['b', 'c'], cut_off_percent=0.4, unseen_levels_to_rare=False)
>>> x.to_json() {'tubular_version': ..., 'classname': 'GroupRareLevelsTransformer', 'init': {'columns': ['b', 'c'], 'copy': False, 'verbose': False, 'return_native': True, 'cut_off_percent': 0.4, 'weights_column': None, 'rare_level_name': 'rare', 'record_rare_levels': True, 'unseen_levels_to_rare': False}, 'fit': {'is_fitted_': True, 'non_rare_levels': {'b': ['w'], 'c': ['a']}, 'training_data_levels': {'b': ['w', 'x', 'y', 'z'], 'c': ['a', 'b', 'c']}, 'rare_levels_record': {'b': ['x', 'y', 'z'], 'c': ['b', 'c']}}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Group rare levels together into a new ‘rare’ level.
- Parameters:
X (DataFrame) – Data to with catgeorical variables to apply rare level grouping to.
- Returns:
X – Transformed input X with rare levels grouped for into a new rare level.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = GroupRareLevelsTransformer( ... columns="a", ... cut_off_percent=0.5, ... rare_level_name="rare_level", ... )
>>> test_df = pl.DataFrame({"a": ["x", "x", "y"], "b": ["w", "z", "z"]})
>>> _ = transformer.fit(test_df)
>>> transformer.transform(test_df) shape: (3, 2) ┌────────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ str │ ╞════════════╪═════╡ │ x ┆ w │ │ x ┆ z │ │ rare_level ┆ z │ └────────────┴─────┘
- class tubular.LowerCaseTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to lower case of text columns.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = LowerCaseTransformer( … columns=[“a”], … ) >>> transformer LowerCaseTransformer(columns=[‘a’])
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'LowerCaseTransformer', 'fit': {'is_fitted_': False}, 'init': {'columns': ['a'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> LowerCaseTransformer.from_json(json_dump) LowerCaseTransformer(columns=['a'])
- FITS = False
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Lower case of text in given columns.
- Parameters:
X (DataFrame) – Data containing columns to lowercase.
- Returns:
X – Transformed input X with text lowercased in given columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [“HeLlO”, None, “ HI”]}) >>> transformer = LowerCaseTransformer(columns=”a”) >>> transformer.transform(test_df) shape: (3, 1) ┌───────┐ │ a │ │ — │ │ str │ ╞═══════╡ │ hello │ │ null │ │ hi │ └───────┘
- class tubular.MappingTransformer(mappings: dict[str, dict[Any, Any]], return_dtypes: dict[str, RETURN_DTYPES] | None = None, **kwargs: bool | None)[source]
Bases:
BaseMappingTransformer,BaseMappingTransformMixinTransformer to map values in columns to other values e.g. to merge two levels into one.
Note, the MappingTransformer does not require ‘self-mappings’ to be defined i.e. if you want to map a value to itself, you can omit this value from the mappings rather than having to map it to itself.
This transformer inherits from BaseMappingTransformMixin as well as the BaseMappingTransformer, BaseMappingTransformer performs standard checks, while BasemappingTransformMixin handles the actual logic.
- Parameters:
mappings (dict) – Dictionary containing column mappings. Each value in mappings should be a dictionary of key (column to apply mapping to) value (mapping dict for given columns) pairs. For example the following dict {‘a’: {1: 2, 3: 4}, ‘b’: {‘a’: 1, ‘b’: 2}} would specify a mapping for column a of 1->2, 3->4 and a mapping for column b of ‘a’->1, b->2.
return_dtype (Optional[Dict[str, RETURN_DTYPES]]) – Dictionary of col:dtype for returned columns
**kwargs – Arbitrary keyword arguments passed onto BaseMappingTransformer.init method.
- mappings
Dictionary of mappings for each column individually. The dict passed to mappings in init is set to the mappings attribute.
- Type:
dict
- mappings_from_null
dict storing what null values will be mapped to. Generally best to use an imputer, but this functionality is useful for inverting pipelines.
- Type:
dict[str, Any]
- return_dtypes
Dictionary of col:dtype for returned columns
- Type:
dict[str, RETURN_DTYPES]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = MappingTransformer( … mappings={“a”: {“Y”: 1, “N”: 0}}, … return_dtypes={“a”: “Int8”}, … ) >>> transformer MappingTransformer(mappings={‘a’: {‘N’: 0, ‘Y’: 1}},
return_dtypes={‘a’: ‘Int8’})
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MappingTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'mappings': {'a': {'Y': 1, 'N': 0}}, 'return_dtypes': {'a': 'Int8'}}, 'fit': {'is_fitted_': True}}
>>> MappingTransformer.from_json(json_dump) MappingTransformer(mappings={'a': {'N': 0, 'Y': 1}}, return_dtypes={'a': 'Int8'})
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the input data X according to the mappings in the mappings attribute dict.
This method calls the BaseMappingTransformMixin.transform. Note, this transform method is different to some of the transform methods in the nominal module, even though they also use the BaseMappingTransformMixin.transform method. Here, if a value does not exist in the mapping it is unchanged.
- Parameters:
X (DataFrame) – Data with nominal columns to transform.
- Returns:
X – Transformed input X with levels mapped according to mappings dict.
- Return type:
DataFrame
Examples
``pycon >>> import polars as pl
>>> transformer = MappingTransformer( ... mappings={'a': {'Y': 1, 'N': 0}}, ... return_dtypes={"a":"Int8"}, ... )
>>> test_df=pl.DataFrame({'a': ["Y", "N"], 'b': [3,4]})
>>> transformer.transform(test_df) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i8 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 0 ┆ 4 │ └─────┴─────┘
- class tubular.MeanImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
WeightColumnMixin,BaseImputerTransformer to impute missing values with the mean of the supplied columns.
- impute_values_
Created during fit method. Dictionary of float / int (mean) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> mean_imputer = MeanImputer( … columns=[“a”, “b”], … ) >>> mean_imputer MeanImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = mean_imputer.fit(test_df)
>>> json_dump = mean_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MeanImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0.0, 'b': 1.0}}}
>>> MeanImputer.from_json(json_dump) MeanImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) MeanImputer[source]
Calculate mean values to impute with from X.
- Parameters:
X (DataFrame) – Data to “learn” the mean values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance.
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = MeanImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 1.0 ┆ 3.0 │ │ 1.5 ┆ 3.5 │ │ 2.0 ┆ 4.0 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.MeanResponseTransformer(columns: str | list[str] | None = None, weights_column: str | None = None, prior: ]] = 0, level: float | int | str | list | None = None, unseen_level_handling: float | int | Literal['mean', 'median', 'min', 'max'] | None = None, return_type: Literal['Float32', 'Float64'] = 'Float32', drop_original: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,WeightColumnMixin,DropOriginalMixinConvert categorical variables to numeric by mapping levels to the mean response for level.
For a continuous or binary response the categorical columns specified will have values replaced with the mean response for each category.
For an n > 1 level categorical response, up to n binary responses can be created, which in turn can then be used to encode each categorical column specified. This will generate up to n * len(columns) new columns, of with names of the form {column}_{response_level}. The original columns will be removed from the dataframe. This functionality is controlled using the ‘level’ parameter. Note that the above only works for a n > 1 level categorical response. Do not use ‘level’ parameter for a n = 1 level numerical response. In this case, use the standard mean response transformer without the ‘level’ parameter.
If a categorical variable contains null values these will not be transformed.
The same weights and prior are applied to each response level in the multi-level case.
- columns
Categorical columns to encode in the input data.
- Type:
str or list
- weights_column
Weights column to use when calculating the mean response.
- Type:
str or None
- prior
Regularisation parameter, can be thought of roughly as the size a category should be in order for its statistics to be considered reliable (hence default value of 0 means no regularisation).
- Type:
int, default = 0
- level
Parameter to control encoding against a multi-level categorical response. If None the response will be treated as binary or continuous, if ‘all’ all response levels will be encoded against and if it is a list of levels then only the levels specified will be encoded against.
- Type:
str, int, float, list or None, default = None
- response_levels
Only created in the multi-level case. Generated from level, list of all the response levels to encode against.
- Type:
list
- mappings
Created in fit. A nested Dict of {column names : column specific mapping dictionary} pairs. Column specific mapping dictionaries contain {initial value : mapped value} pairs.
- Type:
dict
- mapped_columns
Only created in the multi-level case. A list of the new columns produced by encoded the columns in self.columns against multiple response levels, of the form {column}_{level}.
- Type:
list
- transformer_dict
Only created in the multi-level case. A dictionary of the form level : transformer containing the mean response transformers for each level to be encoded against.
- Type:
dict
- unseen_levels_encoding_dict
Dict containing the values (based on chosen unseen_level_handling) derived from the encoded columns to use when handling unseen levels in data passed to transform method.
- Type:
dict
- return_type
What type to cast return column as. Defaults to float32.
- Type:
Literal[‘float32’, ‘float64’]
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... ) >>> transformer MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [0, 1]})
>>> _ = transformer.fit(test_df[["a"]], test_df["b"])
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MeanResponseTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None, 'prior': 1, 'level': None, 'unseen_level_handling': 'mean', 'return_type': 'Float32', 'drop_original': True}, 'fit': {'is_fitted_': True, 'mappings': {'a': {'x': 0.25, 'y': 0.75}}, 'return_dtypes': {'a': 'Float32'}, 'column_to_encoded_columns': {'a': ['a']}, 'encoded_columns': ['a'], 'unseen_levels_encoding_dict': {'a': 0.5}}} >>> MeanResponseTransformer.from_json(json_dump) MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame) MeanResponseTransformer[source]
Identify mapping of categorical levels to mean response values.
If the user specified the weights_column arg in when initialising the transformer the weighted mean response will be calculated using that column.
In the multi-level case this method learns which response levels are present and are to be encoded against.
- Parameters:
X (DataFrame) – Data to with catgeorical variable columns to transform and also containing response_column column.
y (Series or LazyFrame) – Response variable or target.
- Returns:
MeanResponseTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if y contains null values:
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> transformer.fit(test_df, test_df["target"]) MeanResponseTransformer(columns=['a'], prior=1, unseen_level_handling='mean')
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() ['a']
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... level=["x", "y"], ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() ['a_x', 'a_y']
>>> transformer = MeanResponseTransformer( ... columns="a", ... prior=1, ... level="all", ... unseen_level_handling="mean", ... )
>>> transformer.get_feature_names_out() Traceback (most recent call last): ... sklearn.exceptions.NotFittedError: ...
>>> test_df = pl.DataFrame({"a": ["x", "y", "x"], "b": ["cat", "dog", "rat"]})
>>> _ = transformer.fit(test_df, test_df["b"])
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog', 'a_rat']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import polars as pl
>>> transformer = MeanResponseTransformer(columns=["a"])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [0, 1]})
>>> _ = transformer.fit(test_df[["a"]], test_df["b"])
>>> transformer.to_json() {'tubular_version': ..., 'classname': 'MeanResponseTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None, 'prior': 0, 'level': None, 'unseen_level_handling': None, 'return_type': 'Float32', 'drop_original': True}, 'fit': {'is_fitted_': True, 'mappings': {'a': {'x': 0.0, 'y': 1.0}}, 'return_dtypes': {'a': 'Float32'}, 'column_to_encoded_columns': {'a': ['a']}, 'encoded_columns': ['a']}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply mean response encoding stored in the mappings attribute to columns.
- Parameters:
X (DataFrame) – Data with nominal columns to transform.
- Returns:
X – Transformed input X with levels mapped according to mappings dict.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> # example with no prior >>> transformer = MeanResponseTransformer( … columns=”a”, … prior=0, … unseen_level_handling=”mean”, … )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> _ = transformer.fit(test_df, test_df["target"])
>>> transformer.transform(test_df) shape: (2, 3) ┌─────┬─────┬────────┐ │ a ┆ b ┆ target │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞═════╪═════╪════════╡ │ 0.0 ┆ 1 ┆ 0 │ │ 1.0 ┆ 2 ┆ 1 │ └─────┴─────┴────────┘
# example with prior >>> transformer = MeanResponseTransformer( … columns=”a”, … prior=1, … unseen_level_handling=”mean”, … )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2], "target": [0, 1]})
>>> _ = transformer.fit(test_df, test_df["target"])
>>> transformer.transform(test_df) shape: (2, 3) ┌──────┬─────┬────────┐ │ a ┆ b ┆ target │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪═════╪════════╡ │ 0.25 ┆ 1 ┆ 0 │ │ 0.75 ┆ 2 ┆ 1 │ └──────┴─────┴────────┘
- class tubular.MedianImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseImputer,WeightColumnMixinTransformer to impute missing values with the median of the supplied columns.
- impute_values_
Created during fit method. Dictionary of float / int (median) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> median_imputer = MedianImputer( … columns=[“a”, “b”], … ) >>> median_imputer MedianImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = median_imputer.fit(test_df)
>>> json_dump = median_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'MedianImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0.0, 'b': 1.0}}}
>>> MedianImputer.from_json(json_dump) MedianImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) MedianImputer[source]
Calculate median values to impute with from X.
- Parameters:
X (DataFrame) – Data to “learn” the median values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance.
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = MedianImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 1.0 ┆ 3.0 │ │ 1.5 ┆ 3.5 │ │ 2.0 ┆ 4.0 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.ModeImputer(columns: str | list[str], weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseImputer,WeightColumnMixinTransformer to impute missing values with the mode of the supplied columns.
If mode is NaN, a warning will be raised.
- impute_values_
Created during fit method. Dictionary of float / int (mode) values of columns in the columns attribute. Keys of impute_values_ give the column names.
- Type:
dict
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> mode_imputer = ModeImputer( … columns=[“a”, “b”], … ) >>> mode_imputer ModeImputer(columns=[‘a’, ‘b’])
>>> # once fit, transformer can also be dumped to json and reinitialised
>>> test_df = pl.DataFrame({"a": [0, None], "b": [None, 1]})
>>> _ = mode_imputer.fit(test_df)
>>> json_dump = mode_imputer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ModeImputer', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True, 'weights_column': None}, 'fit': {'is_fitted_': True, 'impute_values_': {'a': 0, 'b': 1}}}
>>> ModeImputer.from_json(json_dump) ModeImputer(columns=['a', 'b'])
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) ModeImputer[source]
Calculate mode values to impute with from X.
In the event of a tie, the highest modal value will be returned.
- Parameters:
X (DataFrame) – Data to “learn” the mode values from.
y (Series or LazyFrame or None, default = None) – Not required.
- Returns:
fitted class instance
- Return type:
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = ModeImputer(columns=[“a”, “b”]) >>> imputer = imputer.fit(test_df) >>> imputer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ │ 2 ┆ 4 │ └─────┴─────┘
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- class tubular.NullIndicator(columns: ]] | str, **kwargs: bool | None)[source]
Bases:
BaseTransformerClass to create a binary indicator column for null values.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> null_indicator = NullIndicator( … columns=[“a”, “b”], … ) >>> null_indicator NullIndicator(columns=[‘a’, ‘b’])
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = null_indicator.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'NullIndicator', 'init': {'columns': ['a', 'b'], 'copy': False, 'verbose': False, 'return_native': True}, 'fit': {'is_fitted_': True}}
>>> NullIndicator.from_json(json_dump) NullIndicator(columns=['a', 'b'])
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create new columns indicating the position of null values for each variable in self.columns.
- Parameters:
X (DataFrame) – Data to add indicators to.
- Returns:
dataframe with null indicator columns added
- Return type:
DataFrame
Examples
——–, ```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [1, None, 2], “b”: [3, None, 4]}) >>> imputer = NullIndicator(columns=[“a”, “b”]) >>> imputer.transform(test_df) shape: (3, 4) ┌──────┬──────┬─────────┬─────────┐ │ a ┆ b ┆ a_nulls ┆ b_nulls │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ bool │ ╞══════╪══════╪═════════╪═════════╡ │ 1 ┆ 3 ┆ false ┆ false │ │ null ┆ null ┆ true ┆ true │ │ 2 ┆ 4 ┆ false ┆ false │ └──────┴──────┴─────────┴─────────┘
- class tubular.OneDKmeansTransformer(columns: str | ~typing.Annotated[list[str], beartype.vale.Is[lambda list_arg: ...]], new_column_name: str, n_init: str | int = 'auto', n_clusters: int = 8, drop_original: bool = False, kmeans_kwargs: dict[str, object] | None = None, **kwargs: bool)[source]
Bases:
BaseNumericTransformer,DropOriginalMixinGenerates a new column based on kmeans algorithm.
Transformer runs the kmeans algorithm based on given number of clusters and then identifies the bins’ cuts based on the results. Finally it passes them into the a cut function.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> OneDKmeansTransformer( … columns=”a”, … n_clusters=2, … new_column_name=”new”, … drop_original=False, … kmeans_kwargs={“random_state”: 42}, … ) OneDKmeansTransformer(columns=[‘a’], kmeans_kwargs={‘random_state’: 42},
n_clusters=2, new_column_name=’new’)
- FITS = True
- fit(X: FrameT, y: IntoSeriesT | None = None) OneDKmeansTransformer[source]
Fit transformer to input data.
- Parameters:
X (pd/pl.DataFrame) – Dataframe with columns to learn scaling values from.
y (None) – Required for pipeline.
- Returns:
Fitted class instance.
- Return type:
- Raises:
ValueError: – if columns in X contain missing values.
Examples
```pycon >>> import polars as pl
>>> transformer = OneDKmeansTransformer( ... columns="a", ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... )
>>> test_df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
>>> transformer.fit(test_df) OneDKmeansTransformer(columns=['a'], kmeans_kwargs={'random_state': 42}, n_clusters=2, new_column_name='new')
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = OneDKmeansTransformer( … columns=”a”, … n_clusters=2, … new_column_name=”kmeans_column”, … drop_original=False, … kmeans_kwargs={“random_state”: 42}, … )
>>> transformer.get_feature_names_out() ['kmeans_column']
- jsonable = True
- lazyframe_compatible = False
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
>>> import polars as pl >>> x = OneDKmeansTransformer( ... columns='a', ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... ) >>> test_df=pl.DataFrame({'a': [1,2,3,4], 'b': [5,6,7,8]}) >>> x.fit(test_df) OneDKmeansTransformer(columns=['a'], kmeans_kwargs={'random_state': 42}, n_clusters=2, new_column_name='new') >>> x.to_json() {'tubular_version': ..., 'classname': 'OneDKmeansTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'new_column_name': 'new', 'n_init': 'auto', 'n_clusters': 2, 'drop_original': False, 'kmeans_kwargs': {'random_state': 42}}, 'fit': {'is_fitted_': True, 'bins': [3, 4]}}
- transform(X: FrameT) FrameT[source]
Generate from input pd/pl.DataFrame (X) bins based on Kmeans results and add this column or columns in X.
- Parameters:
X (pl/pd.DataFrame) – Data to transform.
- Returns:
X – Input X with additional cluster column added.
- Return type:
pl/pd.DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = OneDKmeansTransformer( ... columns="a", ... n_clusters=2, ... new_column_name="new", ... drop_original=False, ... kmeans_kwargs={"random_state": 42}, ... )
>>> test_df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
>>> _ = transformer.fit(test_df) >>> transformer.transform(test_df) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ new │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 5 ┆ 0 │ │ 2 ┆ 6 ┆ 0 │ │ 3 ┆ 7 ┆ 0 │ │ 4 ┆ 8 ┆ 1 │ └─────┴─────┴─────┘
- class tubular.OneHotEncodingTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]] | None = None, wanted_values: dict[str, ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]]] | None = None, separator: str = '_', drop_original: bool = False, **kwargs: bool)[source]
Bases:
DropOriginalMixin,BaseTransformerTransformer to convert categorical variables into dummy columns.
- separator
Separator used in naming for dummy columns.
- Type:
str
- drop_original
Should original columns be dropped after creating dummy fields?
- Type:
bool
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... ) >>> transformer OneHotEncodingTransformer(columns=['a'])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> _ = transformer.fit(test_df)
>>> # transformer can also be dumped to json and reinitialised >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'OneHotEncodingTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'wanted_values': None, 'separator': '_', 'drop_original': False}, 'fit': {'is_fitted_': True, 'categories_': {'a': ['x', 'y']}, 'new_feature_names_': {'a': ['a_x', 'a_y']}}}
>>> OneHotEncodingTransformer.from_json(json_dump) OneHotEncodingTransformer(columns=['a'])
- FITS = True
- MAX_LEVELS = 100
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) OneHotEncodingTransformer[source]
Get list of levels for each column to be transformed.
This defines which dummy columns will be created in transform.
- Parameters:
X (DataFrame) – Data to identify levels from.
y (None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns:
OneHotEncodingTransformer
- Return type:
fitted class instance
- Raises:
ValueError – if column has >100 levels:
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2]})
>>> transformer.fit(test_df) OneHotEncodingTransformer(columns=['a'])
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... wanted_values={"a": ["cat", "dog"]}, ... )
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog']
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> transformer.get_feature_names_out() Traceback (most recent call last): ... sklearn.exceptions.NotFittedError: ...
>>> test_df = pl.DataFrame({"a": ["cat", "dog", "rat"]})
>>> _ = transformer.fit(test_df)
>>> transformer.get_feature_names_out() ['a_cat', 'a_dog', 'a_rat']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer(columns=["a"])
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": ["w", "z"]})
>>> _ = transformer.fit(test_df)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'OneHotEncodingTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'wanted_values': None, 'separator': '_', 'drop_original': False}, 'fit': {'is_fitted_': True, 'categories_': {'a': ['x', 'y']}, 'new_feature_names_': {'a': ['a_x', 'a_y']}}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, return_native_override: bool | None = None) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create new dummy columns from categorical fields.
- Parameters:
X (DataFrame) – Data to apply one hot encoding to.
return_native_override (Optional[bool]) – controls whether transformer returns narwhals or native type.
return_native_override
transformer (option to override return_native attr in)
parent (useful when calling)
methods
- Returns:
X_transformed – Transformed input X with dummy columns derived from categorical columns added. If drop_original = True then the original categorical columns that the dummies are created from will not be in the output X.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = OneHotEncodingTransformer( ... columns="a", ... )
>>> test_df = pl.DataFrame({"a": ["x", "y"], "b": [1, 2]})
>>> _ = transformer.fit(test_df)
>>> transformer.transform(test_df) shape: (2, 4) ┌─────┬─────┬───────┬───────┐ │ a ┆ b ┆ a_x ┆ a_y │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ bool ┆ bool │ ╞═════╪═════╪═══════╪═══════╡ │ x ┆ 1 ┆ true ┆ false │ │ y ┆ 2 ┆ false ┆ true │ └─────┴─────┴───────┴───────┘
- class tubular.OutOfRangeNullTransformer(capping_values: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, quantiles: dict[str, ~typing.Annotated[list[int | float | None], beartype.vale.Is[lambda list_arg: ...]]] | None = None, weights_column: str | None = None, **kwargs: bool)[source]
Bases:
BaseCappingTransformerTransformer to set values outside of a range to null.
This transformer sets the cut off values in the same way as the CappingTransformer. So either the user can specify them directly in the capping_values argument or they can be calculated in the fit method, if the user supplies the quantiles argument.
Attributes:
- capping_valuesdict[str, CappingValues] or None
Capping values to apply to each column, capping_values argument.
- quantilesdict[str, CappingValues] or None
Quantiles to set capping values at from input data. Will be empty after init, values populated when fit is run.
- quantile_capping_valuesdict[str, CappingValues] or None
Capping values learned from quantiles (if provided) to apply to each column.
- weights_columnstr or None
weights_column argument.
- _replacement_valuesdict[str, CappingValues]
Replacement values when capping is applied. This will contain nulls for each column.
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> import polars as pl
>>> transformer = OutOfRangeNullTransformer( ... capping_values={"a": [10, 20], "b": [1, 3]}, ... ) >>> transformer OutOfRangeNullTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
# transform method is inherited so also demo that here >>> test_df = pl.DataFrame()
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.transform(test_df) shape: (4, 3) ┌──────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞══════╪══════╪═════╡ │ null ┆ null ┆ 1 │ │ 15 ┆ 2 ┆ 2 │ │ 18 ┆ null ┆ 3 │ │ null ┆ 1 ┆ 4 │ └──────┴──────┴─────┘
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'OutOfRangeNullTransformer', 'init': {'copy': False, 'verbose': False, 'return_native': True, 'capping_values': {'a': [10, 20], 'b': [1, 3]}, 'quantiles': None, 'weights_column': None}, 'fit': {'is_fitted_': False}}
>>> OutOfRangeNullTransformer.from_json(json_dump) OutOfRangeNullTransformer(capping_values={'a': [10, 20], 'b': [1, 3]})
- FITS = True
- fit(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame, y: Series | Series | Series | LazyFrame | LazyFrame | None = None) OutOfRangeNullTransformer[source]
Learn capping values from input data X.
Calculates the quantiles to cap at given the quantiles dictionary supplied when initialising the transformer. Saves learnt values in the capping_values attribute.
- Parameters:
X (DataFrame) – A dataframe with required columns to be capped.
y (None) – Required for pipeline.
- Returns:
OutOfRangeNullTransformer
- Return type:
fitted instance of class
Example
```pycon >>> import polars as pl
>>> transformer = OutOfRangeNullTransformer( ... quantiles={"a": [0.01, 0.99], "b": [0.05, 0.95]}, ... )
>>> test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
>>> transformer.fit(test_df) OutOfRangeNullTransformer(quantiles={'a': [0.01, 0.99], 'b': [0.05, 0.95]})
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- static set_replacement_values(capping_values: dict[str, list[int | float | None]]) dict[str, list[bool | None]][source]
Set the _replacement_values to have all null values.
Keeps the existing keys in the _replacement_values dict and sets all values (except None) in the lists to np.NaN. Any None values remain in place.
- Returns:
replacement_values
- Return type:
replacement values for OutOfRangeNullTransformer
Examples
```pycon >>> import polars as pl
>>> capping_values = {"a": [0.1, 0.2], "b": [None, 10]}
>>> OutOfRangeNullTransformer.set_replacement_values(capping_values) {'a': [None, None], 'b': [False, None]}
- class tubular.RatioTransformer(columns: ]], return_dtype: ]] = 'Float32', **kwargs: bool | None)[source]
Bases:
BaseNumericTransformerTransformer that performs division operation between two columns.
This transformer allows performing division between two columns in a DataFrame and stores the result in a new column.
- columns
List of exactly two column names to operate on. The first column is the numerator, and the second column is the denominator.
- Type:
ListOfTwoStrs
- return_dtype
The dtype of the resulting column, either ‘Float32’ or ‘Float64’.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> transformer.columns [‘a’, ‘b’] >>> transformer.return_dtype ‘Float32’
- FITS = False
- get_feature_names_out() list[str][source]
Get the names of the output features.
- Returns:
List containing the name of the new column created by the transformation.
- Return type:
list[str]
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> ratio_transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> ratio_transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘RatioTransformer’, ‘init’: {‘columns’: [‘a’, ‘b’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘return_dtype’: ‘Float32’}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Transform the DataFrame by applying the division operation between two columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to operate on.
- Returns:
Transformed DataFrame with the new column containing the division results.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> transformer = RatioTransformer(columns=[“a”, “b”], return_dtype=”Float32”) >>> test_df = pl.DataFrame({“a”: [100, 200, 300], “b”: [80, 150, 200]}) >>> transformer.transform(test_df) shape: (3, 3) ┌─────┬─────┬────────────────┐ │ a ┆ b ┆ a_divided_by_b │ │ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ f32 │ ╞═════╪═════╪════════════════╡ │ 100 ┆ 80 ┆ 1.25 │ │ 200 ┆ 150 ┆ 1.333333 │ │ 300 ┆ 200 ┆ 1.5 │ └─────┴─────┴────────────────┘
- class tubular.RemoveCharactersTransformer(columns: str | ~typing.Annotated[list, beartype.vale.Is[lambda list_arg: ...]], characters: list[str], **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer class to remove characters from text columns.
- characters
list of characters to remove from text columns.
- Type:
list[str]
- characters_formatted
characters attr formatted into regex string.
- Type:
str
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- return_native
Controls whether transformer returns narwhals or native pandas/polars type
- Type:
bool, default = True
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = RemoveCharactersTransformer(columns=[“a”], characters=[”\d”]) >>> transformer RemoveCharactersTransformer(characters=[’\d’], columns=[‘a’])
>>> json_dump = transformer.to_json() >>> pprint(json_dump) {'classname': 'RemoveCharactersTransformer', 'fit': {'is_fitted_': False}, 'init': {'characters': ['\\d'], 'columns': ['a'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> RemoveCharactersTransformer.from_json(json_dump) RemoveCharactersTransformer(characters=['\\d'], columns=['a'])
- FITS = False
- get_transform_exprs() list[Expr][source]
Get transform expressions.
- Returns:
list[nw.Expr]
- Return type:
transform expressions for class
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = RemoveCharactersTransformer(columns=[“a”, “b”], characters=[“a”])
>>> pprint(transformer.to_json()) {'classname': 'RemoveCharactersTransformer', 'fit': {'is_fitted_': False}, 'init': {'characters': ['a'], 'columns': ['a', 'b'], 'copy': False, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Strip unwanted characters from specified columns.
- Parameters:
X (DataFrame) – Data containing columns to strip.
- Returns:
X – Transformed input X with characters stripped from specified columns.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl >>> test_df = pl.DataFrame({“a”: [” 8hi!”, None, “9999hello “]}) >>> transformer = RemoveCharactersTransformer(columns=[“a”], characters=[”W”, “s”]) >>> transformer.transform(test_df) shape: (3, 1) ┌───────────┐ │ a │ │ — │ │ str │ ╞═══════════╡ │ 8hi │ │ null │ │ 9999hello │ └───────────┘
- class tubular.RenameColumnsTransformer(columns: ]] | str, new_column_names: dict[str, str], drop_original: bool = True, **kwargs: bool)[source]
Bases:
BaseTransformer,DropOriginalMixinTransformer to rename a given set of columns.
This can be useful for personalising the auto-output names from other transformers, or for creating a few different versions of a given column to undergo separate paths of logic in a pipeline (as the expression logic effectively creates duplicates of the column).
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> from pprint import pprint >>> transformer = RenameColumnsTransformer( … columns=”a”, new_column_names={“a”: “new_a”} … ) # noqa: E501 >>> transformer RenameColumnsTransformer(columns=[‘a’], new_column_names={‘a’: ‘new_a’})
>>> # transformer can also be dumped to json and reinitialised
>>> json_dump = transformer.to_json() >>> pprint(json_dump, sort_dicts=True) {'classname': 'RenameColumnsTransformer', 'fit': {'is_fitted_': True}, 'init': {'columns': ['a'], 'copy': False, 'drop_original': True, 'new_column_names': {'a': 'new_a'}, 'return_native': True, 'verbose': False}, 'tubular_version': ...}
>>> RenameColumnsTransformer.from_json(json_dump) RenameColumnsTransformer(columns=['a'], new_column_names={'a': 'new_a'})
- FITS = False
- get_feature_names_out() list[str][source]
List features modified/created by the transformer.
- Returns:
list of features modified/created by the transformer
- Return type:
list[str]
Examples
```pycon >>> transformer = RenameColumnsTransformer( … columns=[“a”, “b”], … new_column_names={“a”: “new_a”, “b”: “new_b”}, … )
>>> transformer.get_feature_names_out() ['new_a', 'new_b']
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = RenameColumnsTransformer( … columns=”a”, new_column_names={“a”: “new_a”} … ) # noqa: E501 >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘RenameColumnsTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’],
‘copy’: False, ‘drop_original’: True, ‘new_column_names’: {‘a’: ‘new_a’}, ‘return_native’: True, ‘verbose’: False},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Create column copies.
- Parameters:
X (DataFrame) – Data to apply mappings to.
- Returns:
X – Transformed input X with columns set to value.
- Return type:
DataFrame
- Raises:
ValueError – if new_column_names values are already present in X:
Examples
```pycon >>> import polars as pl
>>> transformer = RenameColumnsTransformer( ... columns="a", new_column_names={"a": "new_a"} ... ) # noqa: E501
>>> test_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> transformer.transform(test_df) shape: (3, 2) ┌─────┬───────┐ │ b ┆ new_a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═══════╡ │ 4 ┆ 1 │ │ 5 ┆ 2 │ │ 6 ┆ 3 │ └─────┴───────┘
- class tubular.SetValueTransformer(columns: ]] | str, value: int | float | str | bool | None, **kwargs: bool)[source]
Bases:
BaseTransformerTransformer to set value of column(s) to a given value.
This should be used if columns need to be set to a constant value.
- built_from_json
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- Type:
bool
- polars_compatible
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- Type:
bool
- jsonable
class attribute, indicates if transformer supports to/from_json methods
- Type:
bool
- FITS
class attribute, indicates whether transform requires fit to be run first
- Type:
bool
- lazyframe_compatible
class attribute, indicates whether transformer works with lazyframes
- Type:
bool
Examples
```pycon >>> SetValueTransformer(columns=”a”, value=1) SetValueTransformer(columns=[‘a’], value=1)
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = SetValueTransformer(columns=”a”, value=1) >>> transformer.to_json() {‘tubular_version’: …, ‘classname’: ‘SetValueTransformer’, ‘init’: {‘columns’: [‘a’], ‘copy’: False, ‘verbose’: False, ‘return_native’: True, ‘value’: 1}, ‘fit’: {’is_fitted_’: True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Set columns to value.
- Parameters:
X (DataFrame) – Data to apply mappings to.
- Returns:
X – Transformed input X with columns set to value.
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = SetValueTransformer(columns="a", value=1)
>>> test_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> transformer.transform(test_df) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i32 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 1 ┆ 5 │ │ 1 ┆ 6 │ └─────┴─────┘
- class tubular.ToDatetimeTransformer(columns: str | list[str], time_format: str | None = None, **kwargs: bool)[source]
Bases:
BaseTransformerClass to transform convert specified columns to datetime.
Class simply uses the pd.to_datetime method on the specified columns.
Attributes:
- built_from_json: bool
indicates if transformer was reconstructed from json, which limits it’s supported functionality to .transform
- polars_compatiblebool
class attribute, indicates whether transformer has been converted to polars/pandas agnostic narwhals framework
- jsonable: bool
class attribute, indicates if transformer supports to/from_json methods
- FITS: bool
class attribute, indicates whether transform requires fit to be run first
- lazyframe_compatible: bool
class attribute, indicates whether transformer works with lazyframes
Example:
```pycon >>> transformer = ToDatetimeTransformer( … columns=”a”, … time_format=”%d/%m/%Y”, … ) >>> transformer ToDatetimeTransformer(columns=[‘a’], time_format=’%d/%m/%Y’)
>>> # version will vary for local vs CI, so use ... as generic match >>> json_dump = transformer.to_json() >>> json_dump {'tubular_version': ..., 'classname': 'ToDatetimeTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'time_format': '%d/%m/%Y'}, 'fit': {'is_fitted_': True}}
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Dump transformer to json dict.
- Returns:
jsonified transformer. Nested dict containing levels for attributes set at init and fit.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> transformer = ToDatetimeTransformer(columns=”a”, time_format=”%d/%m/%Y”)
>>> # version will vary for local vs CI, so use ... as generic match >>> transformer.to_json() {'tubular_version': ..., 'classname': 'ToDatetimeTransformer', 'init': {'columns': ['a'], 'copy': False, 'verbose': False, 'return_native': True, 'time_format': '%d/%m/%Y'}, 'fit': {'is_fitted_': True}}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Convert specified column to datetime using pd.to_datetime.
- Parameters:
X (DataFrame) – Data with column to transform.
- Returns:
dataframe with provided columns converted to datetime
- Return type:
DataFrame
Examples
```pycon >>> import polars as pl
>>> transformer = ToDatetimeTransformer( ... columns="a", ... time_format="%d/%m/%Y", ... )
>>> test_df = pl.DataFrame({"a": ["01/02/2020", "10/12/1996"], "b": [1, 2]})
>>> transformer.transform(test_df) shape: (2, 2) ┌─────────────────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2020-02-01 00:00:00 ┆ 1 │ │ 1996-12-10 00:00:00 ┆ 2 │ └─────────────────────┴─────┘
- class tubular.WhenThenOtherwiseTransformer(columns: ]], when_column: str, then_column: str, **kwargs: bool | None)[source]
Bases:
BaseTransformerTransformer to apply conditional logic across multiple columns.
This transformer evaluates specified columns against a condition and updates with given values based on the results.
- polars_compatible
Indicates whether transformer has been converted to polars/pandas agnostic narwhals framework.
- Type:
bool
- FITS
Indicates whether transform requires fit to be run first.
- Type:
bool
- jsonable
Indicates if transformer supports to/from_json methods.
- Type:
bool
- lazyframe_compatible
Indicates whether transformer works with lazyframes.
- Type:
bool
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame( … { … “a”: [1, 2, 3], … “b”: [4, 5, 6], … “condition_col”: [True, False, True], … “update_col”: [10, 20, 30], … } … ) >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], when_column=”condition_col”, then_column=”update_col” … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 4) ┌─────┬─────┬───────────────┬────────────┐ │ a ┆ b ┆ condition_col ┆ update_col │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ i64 │ ╞═════╪═════╪═══════════════╪════════════╡ │ 10 ┆ 10 ┆ true ┆ 10 │ │ 2 ┆ 5 ┆ false ┆ 20 │ │ 30 ┆ 30 ┆ true ┆ 30 │ └─────┴─────┴───────────────┴────────────┘
- FITS = False
- jsonable = True
- lazyframe_compatible = True
- polars_compatible = True
- to_json() dict[str, dict[str, Any]][source]
Serialize the transformer to a JSON-compatible dictionary.
- Returns:
JSON representation of the transformer, including init parameters.
- Return type:
dict[str, dict[str, Any]]
Examples
```pycon >>> from pprint import pprint >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], … when_column=”condition_col”, … then_column=”update_col”, # noqa: E501 … ) >>> pprint(transformer.to_json(), sort_dicts=True) {‘classname’: ‘WhenThenOtherwiseTransformer’,
‘fit’: {’is_fitted_’: True}, ‘init’: {‘columns’: [‘a’, ‘b’],
‘copy’: False, ‘return_native’: True, ‘then_column’: ‘update_col’, ‘verbose’: False, ‘when_column’: ‘condition_col’},
‘tubular_version’: …}
- transform(X: DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame) DataFrame | DataFrame | LazyFrame | DataFrame | LazyFrame[source]
Apply conditional logic to transform specified columns.
- Parameters:
X (DataFrame) – DataFrame containing the columns to be transformed.
- Returns:
Transformed DataFrame with updated columns based on conditions.
- Return type:
DataFrame
- Raises:
TypeError – If the when_column is not of type Boolean or if columns have mismatched types.
Examples
```pycon >>> import polars as pl >>> df = pl.DataFrame( … { … “a”: [1, 2, 3], … “b”: [4, 5, 6], … “condition_col”: [True, False, True], … “update_col”: [10, 20, 30], … } … ) >>> transformer = WhenThenOtherwiseTransformer( … columns=[“a”, “b”], … when_column=”condition_col”, … then_column=”update_col”, … ) >>> transformed_df = transformer.transform(df) >>> print(transformed_df) shape: (3, 4) ┌─────┬─────┬───────────────┬────────────┐ │ a ┆ b ┆ condition_col ┆ update_col │ │ — ┆ — ┆ — ┆ — │ │ i64 ┆ i64 ┆ bool ┆ i64 │ ╞═════╪═════╪═══════════════╪════════════╡ │ 10 ┆ 10 ┆ true ┆ 10 │ │ 2 ┆ 5 ┆ false ┆ 20 │ │ 30 ┆ 30 ┆ true ┆ 30 │ └─────┴─────┴───────────────┴────────────┘