在Pandas数据帧列的子集上使用Pipeline中的scikit StandardScaler(Using scikit StandardScaler in Pipeline on a subset

编程入门 行业动态 更新时间:2024-10-22 20:33:15
在Pandas数据帧列的子集上使用Pipeline中的scikit StandardScaler(Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns)

我想在pandas dataframe列的子集上使用sklearn.preprocessing.StandardScaler。 在管道之外,这是微不足道的:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

但现在假设我在字符串类型的df中有列'C',以及以下管道定义

from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)

如何告诉StandardScaler只扩展A列和B列?

我已经习惯了SparkML管道,其中要缩放的要素可以传递给缩放器组件的构造函数:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

注意:要素列包含稀疏向量,其中包含Spark的VectorAssembler创建的所有数字要素列

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

But now assume I have column 'C' in df of type string and the following pipeline definition

from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)

How can I tell StandardScaler to only scale columns A and B?

I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

最满意答案

在直接sklearn中,您需要将FunctionTransformer与FeatureUnion一起使用。 也就是说,您的管道将如下所示:

pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])

在功能联合中,一个函数将标准缩放器应用于某些列,另一个函数将不变地传递其他列。


使用Ibex (我完全共同编写以使sklearn和pandas更好地工作),您可以按如下方式编写它:

from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:

pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])

where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.


Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:

from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

更多推荐

本文发布于:2023-08-07 16:35:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1465093.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:子集   数据   Pipeline   Pandas   scikit

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!