十五章 反信用卡欺诈"/>
《Web安全之深度学习实战》笔记:第十五章 反信用卡欺诈
本章主要以Credit Card Fraud Detection数据集为例子介绍针对信用卡欺诈的检测技术,使用特征提取方法为标准化,以及基于标准化基础上的降采样和过采样,介绍的分类算法包括朴素贝叶斯、XGBoost和多层感知机。相对于其他章节,本小节主要是学习过采样和降采样的处理方法,这在机器学习领域是非常重要的知识。
一、信用卡欺诈
信用卡欺诈是指故意使用伪造、作废的信用卡,冒用他人的信用卡骗取财物,或用本人信用卡进行恶意透支的行为,常见的信用卡欺诈主要包括以下几种形式。
·失卡冒用 。失卡一般有三种情况,一是发卡银行在向持卡人寄卡时丢失,即未达卡;二是持卡人自己保管不善丢失;三是被不法分子窃取。
·假冒申请 。一般都是利用他人资料申请信用卡,或是故意填写虚假资料。最常见的是伪造身份证,填报虚假单位或家庭地址。
·伪造信用卡 。国际上的信用卡诈骗案件中,有60%以上是伪造卡诈骗,其特点是团伙做案,从盗取卡资料、制造假卡、贩卖假卡,到用假卡作案。伪造者经常利用一些最新的科技手段盗取真实的信用卡资料,有些是用微型测录机窃取信用卡资料,有些是伺机偷改授权机终端功能窃取信用卡资料,当窃取真实的信用卡资料后,便进行批量性的制造假卡,然后通过贩卖假卡大肆作案,牟取暴利 。
二、数据集
测试数据来自Kaggle上Credit Card Fraud Detection数据集,该数据集记录了2013年9月欧洲信用卡交易数据,总共包括两天的交易数据。在284807次交易中包含了492例诈骗。数据集极其不平衡,诈骗频率只占了交易频次的0.172%。Credit Card Fraud Detection数据集为了避免泄露用户隐私,将原始数据做了脱敏等处理,最后使用28维向量描述,分别对应V1-V28,该笔交易发生时间为Time,该笔交易涉及的金额定义为Amount,该笔交易是否为欺诈定义为Class字段,其中1表示为欺诈,0表示为正常交易。
三、特征提取
(一)标准化处理
def get_feature():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y=df['Class']features = df.drop(['Class'], axis=1).columnsx=df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)return x_train, x_test, y_train, y_test
(二)标准化&降采样
np.random.choice可以从整数或一维数组里随机选取内容,并将选取结果放入n维数组中返回
函数原型为:
numpy.random.choice(a, size=None, replace=True, p=None)
其中,a通常表示对应的数组,如果为整数,可以理解为一个连续的整数集合;size表明随机挑选的个数,常见使用方法如下 :
#1-5这些数中随机选择3个
>>> np.random.choice(5, 3)
array([0, 3, 4])
#1-5这些数中按照概率p表,随机选择3个
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0])
>>> np.random.choice(5, 3, replace=False)
array([3,1,0])
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0])
相对于标准化,diff如下所示,主要是随机在较多的白样本选取与黑样本相同数量number_fraud
的白样本,再与黑样本进行合并
于是完整的降采样方法代码如下所示:
def get_feature_undersampling():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)number_fraud=len(df[df.Class==1])fraud_index=np.array(df[df.Class==1].index)normal_index=df[df.Class==0].indexrandom_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)x_index=np.concatenate([fraud_index,random_choice_index])df = df.drop(['Class'], axis=1)x=df.iloc[x_index,:]y=[1]*number_fraud+[0]*number_fraudx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)return x_train, x_test, y_train, y_test
还有一种方法是,先对训练集、测试集进行分割,再对训练集进行降采样,代码相对于标准化,diff如下
完整降采样的方法如下:
def get_feature_undersampling_2():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y = df['Class']features = df.drop(['Class'], axis=1).columnsx = df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)print ("raw data")print (pd.value_counts(y_train))number_fraud=len(y_train[y_train==1])print (number_fraud)fraud_index=np.array(y_train[y_train==1].index)print (fraud_index)normal_index=y_train[y_train==0].indexrandom_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)x_index=np.concatenate([fraud_index,random_choice_index])print (x_index)x_train_1=x.iloc[x_index,:]y_train_1=[1]*number_fraud+[0]*number_fraudreturn x_train_1, x_test, y_train_1, y_test
(三)标准化&过采样
解决黑白样本不均衡的问题还有一种方式叫做“过采样”。与劫富济贫的欠采样相反,过采样保留数量占优势的样本,通过一定的算法,在数量较少样本的基础上生成新样本。在本例中,保留白样本,通过一定的算法,在原有黑样本的基础上生成新的黑样本,最终形成的样本同样可以达到黑白样本均衡。其中最常见的生成算法就是Smote
def get_feature_upsampling():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y = df['Class']features = df.drop(['Class'], axis=1).columnsx = df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)print ("raw data")print (pd.value_counts(y_train))os = SMOTE(random_state=0)x_train_1,y_train_1=os.fit_resample(x_train,y_train)print ("Smote data")print (pd.value_counts(y_train_1))return x_train, x_test, y_train, y_test
相对于标准化,diff如下
四、模型构建
(一)NB
def do_nb(x_train, x_test, y_train, y_test):gnb = GaussianNB()gnb.fit(x_train, y_train)y_pred = gnb.predict(x_test)do_metrics(y_test,y_pred)
(二)XGBOOST
def do_xgboost(x_train, x_test, y_train, y_test):xgb_model = xgb.XGBClassifier().fit(x_train, y_train)y_pred = xgb_model.predict(x_test)do_metrics(y_test, y_pred)
(三)MLP
def do_mlp(x_train, x_test, y_train, y_test):#mlpclf = MLPClassifier(solver='lbfgs',alpha=1e-5,hidden_layer_sizes=(5, 2),random_state=1)clf.fit(x_train, y_train)y_pred = clf.predict(x_test)do_metrics(y_test,y_pred)
五、运行结果
(一)标准化
XGBoost
metrics.accuracy_score:
0.9995084399111681
metrics.confusion_matrix:
[[113700 14][ 42 167]]
metrics.precision_score:
0.9226519337016574
metrics.recall_score:
0.7990430622009569
metrics.f1_score:
0.8564102564102563mlp
metrics.accuracy_score:
0.9994118834651475
metrics.confusion_matrix:
[[113701 13][ 54 155]]
metrics.precision_score:
0.9226190476190477
metrics.recall_score:
0.7416267942583732
metrics.f1_score:
0.8222811671087533nb
metrics.accuracy_score:
0.9787926933103939
metrics.confusion_matrix:
[[111334 2380][ 36 173]]
metrics.precision_score:
0.06776341558950255
metrics.recall_score:
0.8277511961722488
metrics.f1_score:
0.12527154236060825Process finished with exit code 0
(二)降采样
XGBoost
metrics.accuracy_score:
0.9137055837563451
metrics.confusion_matrix:
[[190 10][ 24 170]]
metrics.precision_score:
0.9444444444444444
metrics.recall_score:
0.8762886597938144
metrics.f1_score:
0.9090909090909091mlp
metrics.accuracy_score:
0.9187817258883249
metrics.confusion_matrix:
[[187 13][ 19 175]]
metrics.precision_score:
0.9308510638297872
metrics.recall_score:
0.9020618556701031
metrics.f1_score:
0.9162303664921466nb
metrics.accuracy_score:
0.9010152284263959
metrics.confusion_matrix:
[[193 7][ 32 162]]
metrics.precision_score:
0.9585798816568047
metrics.recall_score:
0.8350515463917526
metrics.f1_score:
0.8925619834710744
(三)过采样
raw data
0 170576
1 308
Name: Class, dtype: int64
Smote data
1 170576
0 170576
Name: Class, dtype: int64
XGBoost
metrics.accuracy_score:
0.9996401077921052
metrics.confusion_matrix:
[[113731 8][ 33 151]]
metrics.precision_score:
0.949685534591195
metrics.recall_score:
0.8206521739130435
metrics.f1_score:0.880466472303207mlp
metrics.accuracy_score:
0.9993943277476892
metrics.confusion_matrix:
[[113698 41][ 28 156]]
metrics.precision_score:
0.7918781725888325
metrics.recall_score:
0.8478260869565217
metrics.f1_score:
0.8188976377952757nb
metrics.accuracy_score:
0.9790911405071847
metrics.confusion_matrix:
[[111382 2357][ 25 159]]
metrics.precision_score:
0.06319554848966613
metrics.recall_score:
0.8641304347826086
metrics.f1_score:
0.11777777777777776Process finished with exit code 0
更多推荐
《Web安全之深度学习实战》笔记:第十五章 反信用卡欺诈
发布评论