太空泰坦尼克号"/>
太空泰坦尼克号
基于XGBClassifier太空泰坦尼克号数据集分类
数据集:kaggle泰坦尼克号宇宙飞船
得分:
数据预处理
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import missingno as msn #缺失值可视化
导入数据集
test = pd.read_csv('./test.csv')
sample = pd.read_csv('./sample_submission.csv')
train = pd.read_csv('./train.csv')
查看数据信息
print(train.isnull().sum())
print(train.info())
#缺失值可视化
msn.matrix(train)
print(test.isnull().sum())
print(test.info())
#缺失值可视化
msn.matrix(test)
#定义得分函数
def get_score(model,X,y):n = cross_val_score(model,X,y,scoring ='accuracy',cv=20)return n
缺失值填充
fill_col = [ 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age','VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck','Name',]# 对于分类型变量的缺失值用众数进行填充、对于数值型变量的缺失值用平均数进行填充
for s in fill_col:if s in train.columns:if train[s].dtype==object:fill_none = train[s].value_counts().index[0]else:fill_none = np.mean(train[s])train[s] = train[s].fillna(fill_none)
#类别数据矢量化
for s in train.columns:if train[s].dtype == object:df_ob = {label: idx for idx,label in enumerate(set(train[s]))}train[s] = train[s].map(df_ob)
train['CryoSleep'] = train["CryoSleep"].map({False:0,True:1})
train['VIP'] = train["VIP"].map({False:0,True:1})
train['Transported'] = train["Transported"].map({False:0,True:1})
# Test
test = test.fillna(method='ffill')
#类别数据矢量化
for s in test.columns:if test[s].dtype == object:df_ob = {label: idx for idx,label in enumerate(set(test[s]))}test[s] = test[s].map(df_ob)
test['CryoSleep'] = test["CryoSleep"].map({False:0,True:1})
test['VIP'] = test["VIP"].map({False:0,True:1})
模型搭建
y = train['Transported']
X = train.drop(columns = ['Transported'])
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=30)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from bayes_opt import BayesianOptimization
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.model_selection import cross_val_predict,cross_validate
from xgboost import XGBClassifiermodel = xgb.XGBClassifier(learning_rate=0.01,n_estimators=227, # 树的个数-10棵树建立xgboost\n",max_depth=4, # 树的深度\n",min_child_weight = 1, # 叶子节点最小权重\n",gamma=5, # 惩罚项中叶子结点个数前的参数\n",subsample=1.0, # 所有样本建立决策树\n",colsample_btree=0.76, # 所有特征建立决策树\n",scale_pos_weight=1, # 解决样本个数不平衡的问题\n",random_state=27, # 随机数\n",verbosity = 0,)
model.fit(X_train,y_train)
rf_grid_1_best = model.predict(test)
sample['Transported'] = rf_grid_1_best.astype(bool)
sample.to_csv('submission1.csv', index=False)
提交结果获取得分
更多推荐
太空泰坦尼克号
发布评论