《Web安全之深度学习实战》笔记：第十五章反信用卡欺诈

编程入门行业动态更新时间:2024-10-25 12:18:55

《Web安全之深度学习实战》笔记：第<a href=https://www.elefans.com/category/jswz/34/1706105.html style= 十五章反信用卡欺诈"/>

《Web安全之深度学习实战》笔记：第十五章反信用卡欺诈

本章主要以Credit Card Fraud Detection数据集为例子介绍针对信用卡欺诈的检测技术，使用特征提取方法为标准化，以及基于标准化基础上的降采样和过采样，介绍的分类算法包括朴素贝叶斯、XGBoost和多层感知机。相对于其他章节，本小节主要是学习过采样和降采样的处理方法，这在机器学习领域是非常重要的知识。

一、信用卡欺诈

        信用卡欺诈是指故意使用伪造、作废的信用卡，冒用他人的信用卡骗取财物，或用本人信用卡进行恶意透支的行为，常见的信用卡欺诈主要包括以下几种形式。
        ·失卡冒用。失卡一般有三种情况，一是发卡银行在向持卡人寄卡时丢失，即未达卡；二是持卡人自己保管不善丢失；三是被不法分子窃取。
        ·假冒申请。一般都是利用他人资料申请信用卡，或是故意填写虚假资料。最常见的是伪造身份证，填报虚假单位或家庭地址。
        ·伪造信用卡。国际上的信用卡诈骗案件中，有60%以上是伪造卡诈骗，其特点是团伙做案，从盗取卡资料、制造假卡、贩卖假卡，到用假卡作案。伪造者经常利用一些最新的科技手段盗取真实的信用卡资料，有些是用微型测录机窃取信用卡资料，有些是伺机偷改授权机终端功能窃取信用卡资料，当窃取真实的信用卡资料后，便进行批量性的制造假卡，然后通过贩卖假卡大肆作案，牟取暴利。

二、数据集

测试数据来自Kaggle上Credit Card Fraud Detection数据集，该数据集记录了2013年9月欧洲信用卡交易数据，总共包括两天的交易数据。在284807次交易中包含了492例诈骗。数据集极其不平衡，诈骗频率只占了交易频次的0.172%。Credit Card Fraud Detection数据集为了避免泄露用户隐私，将原始数据做了脱敏等处理，最后使用28维向量描述，分别对应V1-V28，该笔交易发生时间为Time，该笔交易涉及的金额定义为Amount，该笔交易是否为欺诈定义为Class字段，其中1表示为欺诈，0表示为正常交易。

三、特征提取

（一）标准化处理

def get_feature():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y=df['Class']features = df.drop(['Class'], axis=1).columnsx=df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)return x_train, x_test, y_train, y_test

（二）标准化&降采样

np.random.choice可以从整数或一维数组里随机选取内容，并将选取结果放入n维数组中返回

函数原型为：

numpy.random.choice(a, size=None, replace=True, p=None)

其中，a通常表示对应的数组，如果为整数，可以理解为一个连续的整数集合；size表明随机挑选的个数，常见使用方法如下：

#1-5这些数中随机选择3个
>>> np.random.choice(5, 3) 
array([0, 3, 4])
#1-5这些数中按照概率p表，随机选择3个
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0])
>>> np.random.choice(5, 3, replace=False)
array([3,1,0])
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0])

相对于标准化，diff如下所示，主要是随机在较多的白样本选取与黑样本相同数量number_fraud

的白样本，再与黑样本进行合并

于是完整的降采样方法代码如下所示：

def get_feature_undersampling():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)number_fraud=len(df[df.Class==1])fraud_index=np.array(df[df.Class==1].index)normal_index=df[df.Class==0].indexrandom_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)x_index=np.concatenate([fraud_index,random_choice_index])df = df.drop(['Class'], axis=1)x=df.iloc[x_index,:]y=[1]*number_fraud+[0]*number_fraudx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)return x_train, x_test, y_train, y_test

还有一种方法是，先对训练集、测试集进行分割，再对训练集进行降采样，代码相对于标准化，diff如下

完整降采样的方法如下：

def get_feature_undersampling_2():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y = df['Class']features = df.drop(['Class'], axis=1).columnsx = df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)print ("raw data")print (pd.value_counts(y_train))number_fraud=len(y_train[y_train==1])print (number_fraud)fraud_index=np.array(y_train[y_train==1].index)print (fraud_index)normal_index=y_train[y_train==0].indexrandom_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)x_index=np.concatenate([fraud_index,random_choice_index])print (x_index)x_train_1=x.iloc[x_index,:]y_train_1=[1]*number_fraud+[0]*number_fraudreturn x_train_1, x_test, y_train_1, y_test

（三）标准化&过采样

解决黑白样本不均衡的问题还有一种方式叫做“过采样”。与劫富济贫的欠采样相反，过采样保留数量占优势的样本，通过一定的算法，在数量较少样本的基础上生成新样本。在本例中，保留白样本，通过一定的算法，在原有黑样本的基础上生成新的黑样本，最终形成的样本同样可以达到黑白样本均衡。其中最常见的生成算法就是Smote

def get_feature_upsampling():df = pd.read_csv("../data/fraud/creditcard.csv")df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))df = df.drop(['Time', 'Amount'], axis=1)y = df['Class']features = df.drop(['Class'], axis=1).columnsx = df[features]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)print ("raw data")print (pd.value_counts(y_train))os = SMOTE(random_state=0)x_train_1,y_train_1=os.fit_resample(x_train,y_train)print ("Smote data")print (pd.value_counts(y_train_1))return x_train, x_test, y_train, y_test

相对于标准化，diff如下

四、模型构建

（一）NB

def do_nb(x_train, x_test, y_train, y_test):gnb = GaussianNB()gnb.fit(x_train, y_train)y_pred = gnb.predict(x_test)do_metrics(y_test,y_pred)

（二）XGBOOST

def do_xgboost(x_train, x_test, y_train, y_test):xgb_model = xgb.XGBClassifier().fit(x_train, y_train)y_pred = xgb_model.predict(x_test)do_metrics(y_test, y_pred)

（三）MLP

def do_mlp(x_train, x_test, y_train, y_test):#mlpclf = MLPClassifier(solver='lbfgs',alpha=1e-5,hidden_layer_sizes=(5, 2),random_state=1)clf.fit(x_train, y_train)y_pred = clf.predict(x_test)do_metrics(y_test,y_pred)

五、运行结果

（一）标准化

XGBoost
metrics.accuracy_score:
0.9995084399111681
metrics.confusion_matrix:
[[113700     14][    42    167]]
metrics.precision_score:
0.9226519337016574
metrics.recall_score:
0.7990430622009569
metrics.f1_score:
0.8564102564102563mlp
metrics.accuracy_score:
0.9994118834651475
metrics.confusion_matrix:
[[113701     13][    54    155]]
metrics.precision_score:
0.9226190476190477
metrics.recall_score:
0.7416267942583732
metrics.f1_score:
0.8222811671087533nb
metrics.accuracy_score:
0.9787926933103939
metrics.confusion_matrix:
[[111334   2380][    36    173]]
metrics.precision_score:
0.06776341558950255
metrics.recall_score:
0.8277511961722488
metrics.f1_score:
0.12527154236060825Process finished with exit code 0

（二）降采样

XGBoost
metrics.accuracy_score:
0.9137055837563451
metrics.confusion_matrix:
[[190  10][ 24 170]]
metrics.precision_score:
0.9444444444444444
metrics.recall_score:
0.8762886597938144
metrics.f1_score:
0.9090909090909091mlp
metrics.accuracy_score:
0.9187817258883249
metrics.confusion_matrix:
[[187  13][ 19 175]]
metrics.precision_score:
0.9308510638297872
metrics.recall_score:
0.9020618556701031
metrics.f1_score:
0.9162303664921466nb
metrics.accuracy_score:
0.9010152284263959
metrics.confusion_matrix:
[[193   7][ 32 162]]
metrics.precision_score:
0.9585798816568047
metrics.recall_score:
0.8350515463917526
metrics.f1_score:
0.8925619834710744

（三）过采样


raw data
0    170576
1       308
Name: Class, dtype: int64
Smote data
1    170576
0    170576
Name: Class, dtype: int64
XGBoost
metrics.accuracy_score:
0.9996401077921052
metrics.confusion_matrix:
[[113731      8][    33    151]]
metrics.precision_score:
0.949685534591195
metrics.recall_score:
0.8206521739130435
metrics.f1_score:0.880466472303207mlp
metrics.accuracy_score:
0.9993943277476892
metrics.confusion_matrix:
[[113698     41][    28    156]]
metrics.precision_score:
0.7918781725888325
metrics.recall_score:
0.8478260869565217
metrics.f1_score:
0.8188976377952757nb
metrics.accuracy_score:
0.9790911405071847
metrics.confusion_matrix:
[[111382   2357][    25    159]]
metrics.precision_score:
0.06319554848966613
metrics.recall_score:
0.8641304347826086
metrics.f1_score:
0.11777777777777776Process finished with exit code 0

更多推荐

《Web安全之深度学习实战》笔记：第十五章反信用卡欺诈

本文发布于:2024-02-26 13:18:49，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1702617.html