我想在pandas dataframe列的子集上使用sklearn.preprocessing.StandardScaler。 在管道之外,这是微不足道的:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])但现在假设我在字符串类型的df中有列'C',以及以下管道定义
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)如何告诉StandardScaler只扩展A列和B列?
我已经习惯了SparkML管道,其中要缩放的要素可以传递给缩放器组件的构造函数:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)注意:要素列包含稀疏向量,其中包含Spark的VectorAssembler创建的所有数字要素列
I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])But now assume I have column 'C' in df of type string and the following pipeline definition
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('standard', StandardScaler()) ]) df_scaled = pipeline.fit_transform(df)How can I tell StandardScaler to only scale columns A and B?
I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler
最满意答案
在直接sklearn中,您需要将FunctionTransformer与FeatureUnion一起使用。 也就是说,您的管道将如下所示:
pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])在功能联合中,一个函数将标准缩放器应用于某些列,另一个函数将不变地传递其他列。
使用Ibex (我完全共同编写以使sklearn和pandas更好地工作),您可以按如下方式编写它:
from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:
pipeline = Pipeline([ ('scale_sum', feature_union(...)) ])where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.
Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:
from ibex.sklearn.preprocessing import StandardScaler from ibex import trans pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>更多推荐
发布评论