在Pandas数据帧列的子集上使用Pipeline中的scikit StandardScaler(Using scikit StandardScaler in Pipeline on a subset

编程入门行业动态更新时间:2024-10-22 20:33:15

在Pandas数据帧列的子集上使用Pipeline中的scikit StandardScaler(Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns)

我想在pandas dataframe列的子集上使用sklearn.preprocessing.StandardScaler。在管道之外，这是微不足道的：

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

但现在假设我在字符串类型的df中有列'C'，以及以下管道定义

from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)

如何告诉StandardScaler只扩展A列和B列？

我已经习惯了SparkML管道，其中要缩放的要素可以传递给缩放器组件的构造函数：

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

注意：要素列包含稀疏向量，其中包含Spark的VectorAssembler创建的所有数字要素列

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

But now assume I have column 'C' in df of type string and the following pipeline definition

from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)

How can I tell StandardScaler to only scale columns A and B?

I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

最满意答案

在直接sklearn中，您需要将FunctionTransformer与FeatureUnion一起使用。也就是说，您的管道将如下所示：

pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])

在功能联合中，一个函数将标准缩放器应用于某些列，另一个函数将不变地传递其他列。

使用Ibex （我完全共同编写以使sklearn和pandas更好地工作），您可以按如下方式编写它：

from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:

pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])

where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.

Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:

from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

更多推荐

本文发布于:2023-08-07 16:35:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1465093.html