我正在做一个sklearn作业,我不明白为什么要用训练均值和sd来标准化和标准化测试数据。 我如何在Python中实现它? 这里是我的火车数据实现:
digits = sklearn.datasets.load_digits() X= digits.data Y= digits.target X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,train_size=0.7) std_scale = preprocessing.StandardScaler().fit(X_train) X_train_std = std_scale.transform(X_train) #X_test_std=??对于火车我认为这是正确的,但对于测试?
I'm working in a sklearn homework and I don't understand why one should standardize and normalize the test data with the training mean and sd. How can I implement this in Python? Here is my implementation for the train data:
digits = sklearn.datasets.load_digits() X= digits.data Y= digits.target X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,train_size=0.7) std_scale = preprocessing.StandardScaler().fit(X_train) X_train_std = std_scale.transform(X_train) #X_test_std=??For the train i think it's correct, but for the test?
最满意答案
为什么?
因为您的分类器/回归器将接受这些标准化值的培训。 你不想用你的训练分类器来预测其他统计数据。
怎么样:
std_scale = preprocessing.StandardScaler().fit(X_train) X_train_std = std_scale.transform(X_train) X_test_std = std_scale.transform(X_test)适合一次,改变你需要改变的任何东西。 这是基于类的StandardScaler (您已经选择的)的优势,与之后没有适用转换所需的信息(基于适合期间获得的统计数据)所需的信息的比例相比。
Why?
Because your classifier/regressor will be trained on those standardizes values. You don't want to use your trained-classifier to predict data which has other statistics.
How:
std_scale = preprocessing.StandardScaler().fit(X_train) X_train_std = std_scale.transform(X_train) X_test_std = std_scale.transform(X_test)Fitting once, transforming whatever you need to transform. That's the advantage of the class-based StandardScaler (which you already had chosen) compared to scale which does not hold the needed information needed for applying transformations (based on these statistics obtained during fit) at a later time.
更多推荐
发布评论