上采样+下采样+分层抽样
数据生成
import pandas as pd
import numpy as npdata = pd.DataFrame(np.random.random((50, 3)), columns=['a', 'b', 'c'])
data['y'] = np.vstack((np.zeros((6, 1)), np.ones((44, 1))))
过采样/上采样
# 朴素随机过采样 过采样后样本类别的比列为1:1
from imblearn.over_sampling import RandomOverSamplerros = RandomOverSampler(random_state=0)
x, y = ros.fit_sample(data[['a', 'b', 'c']], data['y'])ros = RandomOverSampler(random_state=0)x, y = ros.fit_sample(data[['a', 'b', 'c']], data['y'])
# SMOTE过采样
# SMOTE:对于少数类的样本a,随机选择一个最近邻的样本b,
# 然后从a到b的连线上随机选取一个点c作为新的少数类样本
from imblearn.over_sampling import SMOTEx, y = SMOTE(k_neighbors=5).fit_sample(data[['a', 'b', 'c']], data['y']) # 少数类别的样本数不能少于k_neighbors
# ADASYN
# 关注的是在那些基于K近邻分类器被错误分类的原始样本附近生成新的少数类样本
from imblearn.over_sampling import ADASYNx, y = ADASYN().fit_sample(data[['a', 'b', 'c']], data['y'])
下采样
from imblearn.under_sampling import ClusterCentroidscc = ClusterCentroids(random_state=0)
x, y = cc.fit_sample(data[['a', 'b', 'c']], data['y'])
from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(random_state=0)x, y = rus.fit_sample(data[['a', 'b', 'c']], data['y'])
分层抽样
typical = {0: 1, 1: 0.2} # 抽样比例def function(group, typical):name = group.namefrac = typical[name]return group.sample(frac=frac)result = data.groupby('y', group_keys=False).apply(function, typical)
过采样和下采样结合
在之前的SMOTE方法中, 当由边界的样本与其他样本进行过采样差值时, 很容易生成一些噪音数据. 因此, 在过采样之后需要对样本进行清洗. 这样, 第三节中涉及到的TomekLink 与 EditedNearestNeighbours方法就能实现上述的要求. 所以就有了两种结合过采样与下采样的方法:
(i) SMOTETomek
from imblearnbine import SMOTEENNsmote_enn = SMOTEENN(random_state=0)x, y = smote_enn.fit_sample(data[['a', 'b', 'c']], data['y'])
(ii) SMOTEENN
from imblearnbine import SMOTETomeksmote_tomek = SMOTETomek(random_state=0)x, y = smote_tomek.fit_sample(data[['a', 'b', 'c']], data['y'])
更多推荐
上采样+下采样+分层抽样
发布评论