乘客预测"/>
titanic 罹难乘客预测
一、实验任务
泰坦尼克号乘客的生存预测
二、数据详细特征
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
三、代码实现
3.1 数据展示
import pandas as pdtrain = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.info())
print(test.info())
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None
3.2 人工选取对预测有效的特征
selected_feature = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked']X_train = train[selected_feature]
X_test = test[selected_feature]y_train = train['Survived']print(X_train['Embarked'].value_counts())
print(X_test['Embarked'].value_counts())
输出:
S 644
C 168
Q 77
Name: Embarked, dtype: int64
S 270
C 102
Q 46
Name: Embarked, dtype: int64
3.3 填充确实数据
# 选用频率最高的特征值来填充缺失值
X_train['Embarked'].fillna('S', inplace = True)
X_test['Embarked'].fillna('S', inplace = True)
X_train['Age'].fillna(X_train['Age'].mean(), inplace = True)
X_test['Age'].fillna(X_test['Age'].mean(), inplace = True)
X_train.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Pclass 891 non-null int64
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Embarked 891 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 41.8+ KB
3.4 特征向量化
# 特征向量化
from sklearn.feature_extraction import DictVectorizerdict_vec = DictVectorizer(sparse = False)X_train = dict_vec.fit_transform(X_train.to_dict(orient = 'record'))
X_test = dict_vec.fit_transform(X_test.to_dict(orient = 'record'))dict_vec.feature_names_
输出:
['Age','Embarked=C','Embarked=Q','Embarked=S','Parch','Pclass','Sex=female','Sex=male','SibSp']
3.5 预测数据
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifierrfc = RandomForestClassifier()
xgbc = XGBClassifier()# 使用 5 折交叉验证的方法在训练集上分别地对默认配置的
# RandomForestClassifier 和 XGBClassifier 进行性能评估
# 并获得平均分类准确性的得分
from sklearn.cross_validation import cross_val_score
cross_val_score(rfc, X_train, y_train, cv = 5).mean()输出:# 0.81152892044682812cross_val_score(xgbc, X_train, y_train, cv = 5).mean()输出:0.82158492822330198
# 预测操作
# 使用并行网格搜索的方式寻找更好的超参数组合,以期待进一步提高性能
rfc.fit(X_train, y_train)
rfc_y_predict = rfc.predict(X_test)rfc_sumission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':rfc_y_predict})
rfc_sumission.to_csv('./rfc_submission.csv', index = False)
xgbc.fit(X_train, y_train)
xgbc_y_predict = xgbc.predict(X_test)
xgbc_submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':xgbc_y_predict})
xgbc_submission.to_csv('./xgbc_submission.csv', index = False)from sklearn.grid_search import GridSearchCV
params = {'max_depth':[2,3,4,5,6], 'n_estimators':[100, 300, 500, 700, 900, 1100], 'learning_rate':[0.05, 0.1, 0.25, 0.5, 1.0]}
xgbc_best = XGBClassifier()
gs = GridSearchCV(xgbc_best, params, n_jobs = -1, cv = 5, verbose = 1)
gs.fit(X_train, y_train)输出:
# Fitting 5 folds for each of 150 candidates, totalling 750 fits
# [Parallel(n_jobs=-1)]: Done 124 tasks | elapsed: 6.7s
# [Parallel(n_jobs=-1)]: Done 324 tasks | elapsed: 18.1s
# [Parallel(n_jobs=-1)]: Done 574 tasks | elapsed: 32.7s
# [Parallel(n_jobs=-1)]: Done 750 out of 750 | elapsed: 44.1s # # finished
print(gs.best_score_)
print(gs.best_params_)xgbc_best_y_predict = gs.predict(X_test)
xgbc_best_submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':xgbc_best_y_predict})
xgbc_best_submission.to_csv('./xgbc_best_submission.csv', index = False)输出:
# 0.8316498316498316
# {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300}
更多推荐
titanic 罹难乘客预测
发布评论