时序建模)"/>
【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)
本文是关于时间序列处理的, 以JDD人口动态普查与预测比赛为例,本文中的注释相对少很多,至于每个部分为何这么做,会不会过拟合等细节希望大家自己揣摩,本文的代码部分偏多,希望大家自己好好学习
大纲:
1 赛题简介
1.1 前言
1.2 背景介绍
1.3 赛题任务
2 工具包 & 数据导入
2.1 工具包导入
2.2 数据导入
3 训练集&验证集划分
3.1 数据准备
3.1.1 时间转rank
3.2 训练集&测试集划分
3.3 构建标签
3.3.1 直接往后移动N天快速构建我们的标签
4 特征工程
4.1 最近三天的情况
4.2 传统的统计特征
5 模型验证,训练&测试
5.1 误差函数
5.2 基础的LGB模型
5.3 模型训练&测试
5.3.1 正向训练
5.3.2 逆向训练
5.4 格式转换,进行最终提交
5.5 正向逆向融合
1 赛题简介
赛题链接:
=3dca1a91ad2a4a6da201f125ede9601a
1.1 前言
本次的notebook主要介绍的是一种基于短期时序的传统的建模策略,该策略在短期的时间内预测效果还是非常好的,此次我们以刚刚结束的JDD人口动态普查与预测为例,所有的特征工程都可以对应到我之前写的时序必备系列的5个part中,本篇文章大部分的建模思路和kaggle的比赛 类似,不同之处在于我换了一种数据表示方式,加速了kaggle这个比赛的特征提取等过程,具体的各位可以自己揣摩。
1.2 背景介绍
人口普查是政府在各个时期获取人口资料、掌握国情国力的一种最基本的调查方法。人口普查是非常耗时耗力的,新中国成立后,我国只进行过六次全国性的人口普查。在这个数据爆炸式增长,数据科技快速进步的时代,通过人工智能技术借助大数据来估算城市人口,能够让人口普查工作更加高效的完成,节省大量的时间和人力,甚至有可能做到实时动态的人口预测。
1.3 赛题任务
赛题数据均为模拟数据。本赛题要求参赛者利用几个邻近城市的移动通信设备用户数历史变动情况,各区县之间的用户转移情况,以及各个区县内移动通信设备的用户占比(决赛阶段提供)等模拟数据,合理建立预测模型,对上述城市各个区县未来15天的总人口变化情况进行动态预测。本赛题假设一个设备唯一代表一个人,选拔赛阶段总人口的计算口径为移动设备的用户数,决赛阶段总人口的计算口径为移动设备的用户数/移动设备的用户占比。
赛区决赛阶段,我们更换了一批新的数据集,数据来源、时间地点均有所变动,数据格式和选拔赛阶段一致,但我们在训练数据集中模拟去掉了中间5天的数据,参赛选手根据所提供的数据,提交这5天的缺失数据值,同时预测未来10天的数据(对于选拔赛所提供的数据,是否可以继续参考使用,选手可自行评估)。
2 工具包 & 数据导入
2.1 工具包导入
## 数据工具包 import numpy as np np.random.seed(42) import pandas as pd from tqdm import tqdm ## 字符串处理工具包 import string import re import gensim from collections import Counter import pickle from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from sklearn.model_selection import KFold from keras.preprocessing import text, sequence import tensorflow as tf tf.enable_eager_execution() tfe = tf.contrib.eager import warnings warnings.filterwarnings('ignore') import xgboost as xgb import lightgbm as lgb from functools import partial import os import gc from scipy.sparse import vstack import time import multiprocessing as mp import seaborn as sns %matplotlib inline |
2.2 数据导入
flow_train = pd.read_csv('./data/flow_train.csv') flow_train_s = flow_train.loc[((flow_train.date_dt >= 20170523) & (flow_train.date_dt < 20170819 + 5))] flow_train_n = flow_train.loc[((flow_train.date_dt >= 20170819) & (flow_train.date_dt <= 20170930))] |
3 训练集&验证集划分
为了防止过拟合,建议先划分训练集&测试集
3.1 数据准备
获取年月日,星期 & 将时间转化为rank方便构建label
3.1.1 时间转rank,方便后续建模¶
import datetime class Date_Process: def __init__(self): self.rank_dic = {}
def _dateinfo_trans(self,df): df['date_dt'] = df['date_dt'].apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d')) df['year'] = df['date_dt'].map(lambda x:x.year) df['month'] = df['date_dt'].map(lambda x:x.month) df['day'] = df['date_dt'].map(lambda x:x.day) df['day_of_week'] = df['date_dt'].map(lambda x:x.weekday()) return df
def _ranks_fit_transform(self,df): df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day'] rank_sort = np.sort(df['ranks'].unique()) rank_dic = {} for i,val in enumerate(rank_sort): rank_dic[val] = i df['ranks'] = df['ranks'].map(rank_dic) self.rank_dic = rank_dic return df
def _ranks_transform(self,df): df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day'] try: df['ranks'] = df['ranks'].map(self.rank_dic) except: print('Date not in the same range!') return df |
date_process_s = Date_Process() flow_train_s = date_process_s._dateinfo_trans(flow_train_s) flow_train_s = date_process_s._ranks_fit_transform(flow_train_s) date_process_n = Date_Process() flow_train_n = date_process_n._dateinfo_trans(flow_train_n) flow_train_n = date_process_n._ranks_fit_transform(flow_train_n) |
3.2 训练集&测试集划分
注意下面的注释部分是用于验证使用
flow_train_s.tail() |
flow_train_data_n = flow_train_n.copy() flow_train_data_s = flow_train_s.copy() |
3.3 构建标签
3.3.1 直接往后移动N天快速构建我们的标签
def _get_traditional_label_n(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]): for day in tqdm(day_shifts): for label_col in label_cols: df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift( day).values # 注意是往后移动, return df |
def _get_traditional_label_s(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]): for day in tqdm(day_shifts): for label_col in label_cols: df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift(-1 * day).values # 注意是往前移动,不是往后移动 return df |
df_train_n = _get_traditional_label_n(flow_train_data_n) df_train_s = _get_traditional_label_s(flow_train_data_s) |
4 特征工程
4.1 最近三天的情况
from functools import partial from tqdm import tqdm ####### 等价于求最近三天的特征 ########## def _get_last3_days_feature_n(df): grp_col = 'district_code'
###### 为了节省时间而采用 ############# for fea_col in tqdm(['flow_in','flow_out','dwell']): for i in range(1,1+28): df[fea_col + '_after_{}'.format(i)] = df.groupby(grp_col)[fea_col].shift(-1* i).values
recent_features = pd.DataFrame() recent_features[grp_col] = df[grp_col].values recent_features['ranks'] = df['ranks'].values
### 过去三天的特征 ####### for fea_col in tqdm(['flow_in','flow_out','dwell']): # 过去的特征 # recent_features[fea_col +'_last_1'] = df[fea_col + '_after_1'].values recent_features[fea_col +'_last_2'] = df[fea_col + '_after_2'].values recent_features[fea_col +'_last_3'] = df[fea_col + '_after_3'].values
recent_features[fea_col + '_mean_3'] = (df[fea_col].values + df[fea_col + '_after_1'].values + df[fea_col + '_after_2'].values) / 3.0
# 变化特征 # recent_features[fea_col + '_diff_1'] = df[fea_col].values - df[fea_col + '_after_1'].values recent_features[fea_col + '_diff_2'] = df[fea_col + '_after_1'].values - df[fea_col + '_after_2'].values recent_features[fea_col + '_diff_diff'] = recent_features[fea_col + '_diff_1'].values - recent_features[fea_col + '_diff_2'].values recent_features[fea_col + '_divide_1'] = df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5) recent_features[fea_col + '_divide'] = df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5) return df,recent_features |
def _get_last3_days_feature_s(df): grp_col = 'district_code'
###### 为了节省时间而采用 ############# for fea_col in tqdm(['flow_in','flow_out','dwell']): for i in range(1,1+28): df[fea_col + '_before_{}'.format(i)] = df.groupby(grp_col)[fea_col].shift(i).values
recent_features = pd.DataFrame() recent_features[grp_col] = df[grp_col].values recent_features['ranks'] = df['ranks'].values
### 过去三天的特征 ####### for fea_col in tqdm(['flow_in','flow_out','dwell']): # 过去的特征 # recent_features[fea_col +'_last_1'] = df[fea_col + '_before_1'].values recent_features[fea_col +'_last_2'] = df[fea_col + '_before_2'].values recent_features[fea_col +'_last_3'] = df[fea_col + '_before_3'].values
recent_features[fea_col + '_mean_3'] = (df[fea_col].values + df[fea_col + '_before_1'].values + df[fea_col + '_before_2'].values) / 3.0
# 变化特征 # recent_features[fea_col + '_diff_1'] = df[fea_col].values - df[fea_col + '_before_1'].values recent_features[fea_col + '_diff_2'] = df[fea_col + '_before_1'].values - df[fea_col + '_before_2'].values recent_features[fea_col + '_diff_diff'] = recent_features[fea_col + '_diff_1'].values - recent_features[fea_col + '_diff_2'].values recent_features[fea_col + '_divide_1'] = df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5) recent_features[fea_col + '_divide'] = df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5) return df,recent_features |
df_train_expand_n,recent_features_n = _get_last3_days_feature_n(df_train_n) |
4.2 传统的统计特征
from scipy.stats import skew from tsfresh.feature_extraction import feature_calculators as ts from tsfresh.feature_extraction import extract_features from numba import jit def get_quantile(x, percentiles = [0,0.1, 0.2, 0.3,0.4,0.5, 0.6,0.7,0.8,0.9,1]): x_len = len(x) x = np.sort(x) sts_feas = [] for per_ in percentiles: if per_ == 1: sts_feas.append(x[x_len - 1]) else: sts_feas.append(x[int(x_len * per_)]) return sts_feas def _get_sts_features_s(data): #### 1.object_id:count #### grp_col = 'district_code' city_code = 'city_code' df = pd.DataFrame() df[grp_col] = data[grp_col].values df[city_code] = data[city_code].values df['ranks'] = data['ranks'].values
### 过去N天的统计特征 #### for fea_col in tqdm(['flow_in','flow_out','dwell']): for slide_windows in [3,6,13,20,27]: print(fea_col, slide_windows)
slide_cols = [ fea_col + '_before_'+ str(i+1) for i in range(slide_windows)] slide_cols.append(fea_col)
df_tmp = data[slide_cols].values df_tmp_percent = data[slide_cols].copy()
df_tmp_percent['sum_'] = df_tmp_percent.sum(axis=1).values
for col in slide_cols: df_tmp_percent[col] = df_tmp_percent[col].values / (1e-5 + df_tmp_percent['sum_'].values) # 百分比 df_tmp_percent = df_tmp_percent[slide_cols].values
df_grp = data.groupby(city_code)[slide_cols].sum(axis=1).reset_index() df_city_dic = data.groupby(city_code)[slide_cols].sum(axis=1).to_dict()
df['district_' + fea_col + '_last{}_sum'.format(slide_windows)] = np.sum(df_tmp,axis=1) df['district_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1) df['district_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1) df['district_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp,axis=1) df['district_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp,axis=1) df['district_' + fea_col + '_last{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1)
df['district_percent_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp_percent,axis=1)
df['district_percent_' + fea_col + '_last{}_skew'.format(slide_windows)] = skew(df_tmp,axis=1)
df['flow_in_last_week'] = data['flow_in_before_7'].values df['flow_out_last_week'] = data['flow_out_before_7'].values df['dwell_last_week'] = data['dwell_before_7'].values
for fea_col in tqdm(['flow_in','flow_out','dwell']): for slide_windows in range(1,4): print(fea_col, slide_windows)
slide_cols = [ fea_col + '_before_'+ str(i * 7 + 7) for i in range(slide_windows)] slide_cols.append(fea_col) df_tmp = data[slide_cols].values
if slide_windows == 1: df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1) else: df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1) df[fea_col + '_w{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1) df[fea_col + '_w{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1) df[fea_col + '_w{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1) return df |
def _get_sts_features_n(data): #### 1.object_id:count #### grp_col = 'district_code' city_code = 'city_code' df = pd.DataFrame() df[grp_col] = data[grp_col].values df[city_code] = data[city_code].values df['ranks'] = data['ranks'].values
### 过去N天的统计特征 #### for fea_col in tqdm(['flow_in','flow_out','dwell']): for slide_windows in [3,6,13,20,27]: print(fea_col, slide_windows)
slide_cols = [ fea_col + '_after_'+ str(i+1) for i in range(slide_windows)] slide_cols.append(fea_col)
df_tmp = data[slide_cols].values df_tmp_percent = data[slide_cols].copy()
df_tmp_percent['sum_'] = df_tmp_percent.sum(axis=1).values for col in slide_cols: df_tmp_percent[col] = df_tmp_percent[col].values / (1e-5 + df_tmp_percent['sum_'].values) # 百分比 df_tmp_percent = df_tmp_percent[slide_cols].values
df_grp = data.groupby(city_code)[slide_cols].sum(axis=1).reset_index() df_city_dic = data.groupby(city_code)[slide_cols].sum(axis=1).to_dict()
df['district_' + fea_col + '_last{}_sum'.format(slide_windows)] = np.sum(df_tmp,axis=1) df['district_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1) df['district_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1) df['district_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp,axis=1) df['district_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp,axis=1) df['district_' + fea_col + '_last{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1) df['district_' + fea_col + '_last{}_skew'.format(slide_windows)] = skew(df_tmp,axis=1)
df['district_percent_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp_percent,axis=1) df['district_percent_' + fea_col + '_last{}_skew'.format(slide_windows)] = skew(df_tmp_percent,axis=1)
df['flow_in_last_week'] = data['flow_in_after_7'].values df['flow_out_last_week'] = data['flow_out_after_7'].values df['dwell_last_week'] = data['dwell_after_7'].values
for fea_col in tqdm(['flow_in','flow_out','dwell']): for slide_windows in range(1,4): print(fea_col, slide_windows)
slide_cols = [ fea_col + '_after_'+ str(i * 7 + 7) for i in range(slide_windows)] slide_cols.append(fea_col) df_tmp = data[slide_cols].values
if slide_windows == 1: df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1) else: df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1) df[fea_col + '_w{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1) df[fea_col + '_w{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1) df[fea_col + '_w{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1) return df |
%%time sts_features_n = _get_sts_features_n(df_train_expand_n |
%%time sts_features_s = _get_sts_features_s(df_train_expand_s) |
5 模型验证,训练&测试
5.1 误差函数
# 误差计算函数 def get_error(y_pred, y_true): return np.sqrt((np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true),2)))) |
5.2 基础的LGB模型
from sklearn.model_selection import StratifiedKFold def _get_lgb_models_test(train, train_label, n_estimators = 350): lgb_params = { 'boosting_type': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'learning_rate': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.9, 'n_estimators': n_estimators, 'silent': -1, 'verbose': -1, 'max_depth': 3 }
model = lgb.LGBMRegressor(**lgb_params) model.fit( train, train_label, eval_set=[(train, train_label)], verbose=50, eval_metric = 'rmse', early_stopping_rounds=250) return model |
5.3 模型训练&测试
5.3.1 正向训练
grp_col = 'district_code' pred = [] Sub_N = None st = 20170823 for test_day in range(1,6): test_day = str(test_day)
################# 逆向 ################ train_label_n = df_train_n.loc[df_train_n['dwell_' + test_day].isnull() == False][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']] val_label_n = df_train_n.loc[df_train_n['dwell_' + test_day].isnull() == True][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]
train_label_n = train_label_n.loc[train_label_n.ranks < train_label_n['ranks'].min() + 8] train_label_n = train_label_n.merge(sts_features_n,on=['district_code','ranks'], how= 'left') train_label_n = train_label_n.merge(recent_features_n,on=['district_code','ranks'], how= 'left')
val_label_n = val_label_n.merge(sts_features_n,on=['district_code','ranks'], how= 'left') val_label_n = val_label_n.merge(recent_features_n,on=['district_code','ranks'], how= 'left')
train_cols = [col for col in train_label_n.columns if 'dwell_'+ test_day not in col and 'flow_in_'+ test_day not in col and 'flow_out_'+ test_day not in col\ and 'ranks' not in col and train_label_n[col].dtype!='O' and 'month' not in col and 'year' not in col]
print('model training') model_dwell = _get_lgb_models_test(train_label_n[train_cols], train_label_n['dwell_'+ test_day].apply(np.log1p).values,n_estimators=350) model_flowin = _get_lgb_models_test(train_label_n[train_cols], train_label_n['flow_in_'+ test_day].apply(np.log1p).values,n_estimators=350) model_flowout = _get_lgb_models_test(train_label_n[train_cols], train_label_n['flow_out_'+ test_day].apply(np.log1p).values,n_estimators=350)
val_label_n['dwell_'+ test_day+'_predict'] = model_dwell.predict(val_label_n[train_cols]) val_label_n['flow_in_'+ test_day+'_predict'] = model_flowin.predict(val_label_n[train_cols]) val_label_n['flow_out_'+ test_day+'_predict'] = model_flowout.predict(val_label_n[train_cols])
dwell_pred_dict = val_label_n.groupby(grp_col)['dwell_'+ test_day+'_predict'].last().to_dict() flow_in_pred_dict = val_label_n.groupby(grp_col)['flow_in_'+ test_day+'_predict'].last().to_dict() flow_out_pred_dict = val_label_n.groupby(grp_col)['flow_out_'+ test_day+'_predict'].last().to_dict()
val_label_n['date_dt'] = st + 1 - int(test_day) submit_n = val_label_n[['district_code','date_dt']].copy() submit_n = submit_n.drop_duplicates(subset = ['district_code','date_dt'])
submit_n['dwell'] = np.expm1(submit_n['district_code'].map(dwell_pred_dict)) submit_n['flow_in'] = np.expm1(submit_n['district_code'].map(flow_in_pred_dict)) submit_n['flow_out'] = np.expm1(submit_n['district_code'].map(flow_out_pred_dict)) if test_day == '1': Sub_N = submit_n else: Sub_N = pd.concat([Sub_N, submit_n],ignore_index=True) print(Sub_N.shape) |
5.3.2 逆向训练
preds2 = [] grp_col = 'district_code' Sub_S = None st = 20170819 for test_day in range(1,6): test_day = str(test_day) train_label_s = df_train_s.loc[df_train_s['dwell_' + test_day].isnull() == False][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']] val_label_s = df_train_s.loc[df_train_s['dwell_' + test_day].isnull() == True][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]
train_label_s = train_label_s.loc[train_label_s.ranks >= train_label_s['ranks'].max() - 7] train_label_s = train_label_s.merge(sts_features_s, on=['district_code','ranks'], how= 'left') train_label_s = train_label_s.merge(recent_features_s,on=['district_code','ranks'], how= 'left')
val_label_s = val_label_s.merge(sts_features_s,on=['district_code','ranks'], how= 'left') val_label_s = val_label_s.merge(recent_features_s,on=['district_code','ranks'], how= 'left')
train_cols = [col for col in train_label_s.columns if 'dwell_'+ test_day not in col and 'flow_in_'+ test_day not in col and 'flow_out_'+ test_day not in col\ and 'ranks' not in col and train_label_s[col].dtype!='O' and 'year' not in col]
print('model training') model_dwell = _get_lgb_models_test(train_label_s[train_cols], train_label_s['dwell_'+ test_day].apply(np.log1p).values,n_estimators=350) model_flowin = _get_lgb_models_test(train_label_s[train_cols], train_label_s['flow_in_'+ test_day].apply(np.log1p).values,n_estimators=350) model_flowout = _get_lgb_models_test(train_label_s[train_cols], train_label_s['flow_out_'+ test_day].apply(np.log1p).values,n_estimators=350)
val_label_s['dwell_'+ test_day+'_predict'] = model_dwell.predict(val_label_s[train_cols]) val_label_s['flow_in_'+ test_day+'_predict'] = model_flowin.predict(val_label_s[train_cols]) val_label_s['flow_out_'+ test_day+'_predict'] = model_flowout.predict(val_label_s[train_cols])
dwell_pred_dict = val_label_s.groupby(grp_col)['dwell_'+ test_day+'_predict'].last().to_dict() flow_in_pred_dict = val_label_s.groupby(grp_col)['flow_in_'+ test_day+'_predict'].last().to_dict() flow_out_pred_dict = val_label_s.groupby(grp_col)['flow_out_'+ test_day+'_predict'].last().to_dict()
val_label_s['date_dt'] = st - 1 + int(test_day) submit_s = val_label_s[['district_code','date_dt']].copy() submit_s = submit_s.drop_duplicates(subset = ['district_code','date_dt']) submit_s['dwell'] = np.expm1(submit_s['district_code'].map(dwell_pred_dict)) submit_s['flow_in'] = np.expm1(submit_s['district_code'].map(flow_in_pred_dict)) submit_s['flow_out'] = np.expm1(submit_s['district_code'].map(flow_out_pred_dict)) if test_day == '1': Sub_S = submit_s else: Sub_S = pd.concat([Sub_S, submit_s],ignore_index=True)
print(Sub_S.shape) |
5.4 格式转换,进行最终提交
from copy import deepcopy submit_8month = None submit_8month = deepcopy(Sub_S) |
submit_8month = submit_8month.sort_values(['district_code','date_dt']) Sub_N = Sub_N.sort_values(['district_code','date_dt']) |
district_2_city_dic = flow_train.groupby(['district_code'])['city_code'].last().to_dict() submit_8month['city_code'] = submit_8month['district_code'].map(district_2_city_dic) |
5.5 正向逆向融合
weights = [0,0.1,0.5,0.9,1] st = 20170819 for i in range(5): s_weight = 1 - weights[i] n_weight = weights[i] print( st +i) submit_8month.loc[submit_8month['date_dt'] == st +i,'dwell'] = submit_8month.loc[submit_8month['date_dt'] == st +i,'dwell'].values * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'dwell'].values * n_weight submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_in'] = submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_in'].values * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'flow_in'].values * n_weight submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_out'] = submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_out'].values * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'flow_out'].values* n_weight |
submit_8month.to_csv('sub8.csv',index = None) |
本文中有很多提分的操作,希望大家可以自己动手实践,并找出提分最大的模块。
更多推荐
【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)
发布评论