【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)

编程入门 行业动态 更新时间:2024-10-24 12:23:44

【时间序列】JDD人口动态普查与预测,2018(正逆向<a href=https://www.elefans.com/category/jswz/34/1768946.html style=时序建模)"/>

【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)

  • 本文是关于时间序列处理的, 以JDD人口动态普查与预测比赛为例,本文中的注释相对少很多,至于每个部分为何这么做,会不会过拟合等细节希望大家自己揣摩,本文的代码部分偏多,希望大家自己好好学习

大纲:
     1  赛题简介

      • 1.1  前言

      • 1.2  背景介绍

      • 1.3  赛题任务

  • 2  工具包 & 数据导入

      • 2.1  工具包导入

      • 2.2  数据导入

  • 3  训练集&验证集划分

      • 3.1  数据准备

        • 3.1.1  时间转rank  

      • 3.2  训练集&测试集划分

      • 3.3  构建标签

        • 3.3.1  直接往后移动N天快速构建我们的标签

  • 4  特征工程

      • 4.1  最近三天的情况

      • 4.2  传统的统计特征

  • 5  模型验证,训练&测试

      • 5.1  误差函数

      • 5.2  基础的LGB模型

      • 5.3  模型训练&测试

      • 5.3.1  正向训练

      • 5.3.2  逆向训练

      • 5.4  格式转换,进行最终提交

      • 5.5  正向逆向融合

1  赛题简介

  • 赛题链接:

    =3dca1a91ad2a4a6da201f125ede9601a

1.1  前言

本次的notebook主要介绍的是一种基于短期时序的传统的建模策略,该策略在短期的时间内预测效果还是非常好的,此次我们以刚刚结束的JDD人口动态普查与预测为例,所有的特征工程都可以对应到我之前写的时序必备系列的5个part中,本篇文章大部分的建模思路和kaggle的比赛 类似,不同之处在于我换了一种数据表示方式,加速了kaggle这个比赛的特征提取等过程,具体的各位可以自己揣摩。

1.2  背景介绍

人口普查是政府在各个时期获取人口资料、掌握国情国力的一种最基本的调查方法。人口普查是非常耗时耗力的,新中国成立后,我国只进行过六次全国性的人口普查。在这个数据爆炸式增长,数据科技快速进步的时代,通过人工智能技术借助大数据来估算城市人口,能够让人口普查工作更加高效的完成,节省大量的时间和人力,甚至有可能做到实时动态的人口预测。

1.3  赛题任务

赛题数据均为模拟数据。本赛题要求参赛者利用几个邻近城市的移动通信设备用户数历史变动情况,各区县之间的用户转移情况,以及各个区县内移动通信设备的用户占比(决赛阶段提供)等模拟数据,合理建立预测模型,对上述城市各个区县未来15天的总人口变化情况进行动态预测。本赛题假设一个设备唯一代表一个人,选拔赛阶段总人口的计算口径为移动设备的用户数,决赛阶段总人口的计算口径为移动设备的用户数/移动设备的用户占比。

赛区决赛阶段,我们更换了一批新的数据集,数据来源、时间地点均有所变动,数据格式和选拔赛阶段一致,但我们在训练数据集中模拟去掉了中间5天的数据,参赛选手根据所提供的数据,提交这5天的缺失数据值,同时预测未来10天的数据(对于选拔赛所提供的数据,是否可以继续参考使用,选手可自行评估)。

2  工具包 & 数据导入

2.1  工具包导入

## 数据工具包

import numpy as np

np.random.seed(42)

import pandas as pd

from tqdm import tqdm

## 字符串处理工具包

import string

import re

import gensim

from collections import Counter

import pickle

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD 

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import KFold

from keras.preprocessing import text, sequence

import tensorflow as tf

tf.enable_eager_execution()

tfe = tf.contrib.eager

import warnings

warnings.filterwarnings('ignore')

import xgboost as xgb

import lightgbm as lgb

from functools import partial

import os 

import gc

from scipy.sparse import vstack  

import time

import multiprocessing as mp

import seaborn as sns

%matplotlib inline

2.2  数据导入

flow_train = pd.read_csv('./data/flow_train.csv') 

flow_train_s = flow_train.loc[((flow_train.date_dt >= 20170523) & (flow_train.date_dt < 20170819 + 5))]

flow_train_n = flow_train.loc[((flow_train.date_dt >= 20170819) & (flow_train.date_dt <= 20170930))]

3  训练集&验证集划分

  • 为了防止过拟合,建议先划分训练集&测试集

3.1  数据准备

  • 获取年月日,星期 & 将时间转化为rank方便构建label

3.1.1  时间转rank,方便后续建模¶

import datetime 

class Date_Process:

    def __init__(self):

        self.rank_dic = {}

        

    def _dateinfo_trans(self,df):

        df['date_dt']     =   df['date_dt'].apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d')) 

        df['year']        =   df['date_dt'].map(lambda x:x.year)

        df['month']       =   df['date_dt'].map(lambda x:x.month)

        df['day']         =   df['date_dt'].map(lambda x:x.day)

        df['day_of_week'] =   df['date_dt'].map(lambda x:x.weekday())

        return df

    

    def _ranks_fit_transform(self,df):

        df['ranks'] = df['year'] * 400 +  df['month'] * 40 + df['day']

        rank_sort = np.sort(df['ranks'].unique())

        rank_dic = {}

        for i,val in enumerate(rank_sort):

            rank_dic[val] = i

        df['ranks'] = df['ranks'].map(rank_dic)

        self.rank_dic = rank_dic

        return df

    

    def _ranks_transform(self,df):

        df['ranks'] = df['year'] * 400 +  df['month'] * 40 + df['day']

        try:

            df['ranks'] = df['ranks'].map(self.rank_dic) 

        except:

            print('Date not in the same range!')

        return df

date_process_s = Date_Process()

flow_train_s  =  date_process_s._dateinfo_trans(flow_train_s)

flow_train_s  =  date_process_s._ranks_fit_transform(flow_train_s)

date_process_n = Date_Process()

flow_train_n  =  date_process_n._dateinfo_trans(flow_train_n)

flow_train_n  =  date_process_n._ranks_fit_transform(flow_train_n) 

3.2  训练集&测试集划分

  • 注意下面的注释部分是用于验证使用


flow_train_s.tail()

flow_train_data_n  =  flow_train_n.copy()

flow_train_data_s = flow_train_s.copy()

3.3  构建标签

3.3.1  直接往后移动N天快速构建我们的标签

def _get_traditional_label_n(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]):

    for day in tqdm(day_shifts):

        for label_col in label_cols:

            df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift( day).values # 注意是往后移动,

    return df 

def _get_traditional_label_s(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]):

    for day in tqdm(day_shifts):

        for label_col in label_cols:

            df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift(-1 * day).values # 注意是往前移动,不是往后移动 

    return df 

df_train_n = _get_traditional_label_n(flow_train_data_n)

df_train_s = _get_traditional_label_s(flow_train_data_s)

4  特征工程

4.1  最近三天的情况

from functools import partial

from tqdm import tqdm

#######  等价于求最近三天的特征  ##########

def _get_last3_days_feature_n(df): 

    grp_col = 'district_code'

    

    ######  为了节省时间而采用 #############

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for i in range(1,1+28):

            df[fea_col + '_after_{}'.format(i)]   =  df.groupby(grp_col)[fea_col].shift(-1* i).values 

    

    

    recent_features = pd.DataFrame()

    recent_features[grp_col] = df[grp_col].values

    recent_features['ranks'] = df['ranks'].values

    

    ### 过去三天的特征 #######

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        # 过去的特征 #

        recent_features[fea_col +'_last_1']   = df[fea_col + '_after_1'].values 

        recent_features[fea_col +'_last_2']   = df[fea_col + '_after_2'].values 

        recent_features[fea_col +'_last_3']   = df[fea_col + '_after_3'].values 

         

        recent_features[fea_col + '_mean_3']     =  (df[fea_col].values + df[fea_col + '_after_1'].values + df[fea_col + '_after_2'].values) / 3.0

        

        # 变化特征 #

        recent_features[fea_col + '_diff_1']     =  df[fea_col].values -  df[fea_col + '_after_1'].values

        recent_features[fea_col + '_diff_2']     =  df[fea_col + '_after_1'].values -  df[fea_col + '_after_2'].values

        recent_features[fea_col + '_diff_diff']  =  recent_features[fea_col + '_diff_1'].values -  recent_features[fea_col + '_diff_2'].values

        recent_features[fea_col + '_divide_1']   =  df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5)  

        recent_features[fea_col + '_divide']     =  df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5) 

    return df,recent_features

def _get_last3_days_feature_s(df): 

    grp_col = 'district_code'

    

    ######  为了节省时间而采用 #############

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for i in range(1,1+28):

            df[fea_col + '_before_{}'.format(i)]   =  df.groupby(grp_col)[fea_col].shift(i).values 

    

    

    recent_features = pd.DataFrame()

    recent_features[grp_col] = df[grp_col].values

    recent_features['ranks'] = df['ranks'].values

    

    ### 过去三天的特征 #######

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        # 过去的特征 #

        recent_features[fea_col +'_last_1']   = df[fea_col + '_before_1'].values 

        recent_features[fea_col +'_last_2']   = df[fea_col + '_before_2'].values 

        recent_features[fea_col +'_last_3']   = df[fea_col + '_before_3'].values 

        

        recent_features[fea_col + '_mean_3']     =  (df[fea_col].values + df[fea_col + '_before_1'].values + df[fea_col + '_before_2'].values) / 3.0

        

        # 变化特征 #

        recent_features[fea_col + '_diff_1']     =  df[fea_col].values -  df[fea_col + '_before_1'].values

        recent_features[fea_col + '_diff_2']     =  df[fea_col + '_before_1'].values -  df[fea_col + '_before_2'].values

        recent_features[fea_col + '_diff_diff']  =  recent_features[fea_col + '_diff_1'].values -  recent_features[fea_col + '_diff_2'].values

        recent_features[fea_col + '_divide_1']   =  df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5)  

        recent_features[fea_col + '_divide']     =  df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5) 

    return df,recent_features

df_train_expand_n,recent_features_n = _get_last3_days_feature_n(df_train_n)

df_train_expand_s,recent_features_s = _get_last3_days_feature_s(df_train_s)

4.2  传统的统计特征

from scipy.stats import skew

from tsfresh.feature_extraction import feature_calculators as ts

from tsfresh.feature_extraction import extract_features

from numba import jit 

def get_quantile(x, percentiles = [0,0.1, 0.2, 0.3,0.4,0.5, 0.6,0.7,0.8,0.9,1]):

    x_len = len(x)

    x = np.sort(x)

    sts_feas = []  

    for per_ in percentiles:

        if per_ == 1:

            sts_feas.append(x[x_len - 1]) 

        else:

            sts_feas.append(x[int(x_len * per_)]) 

    return sts_feas

def _get_sts_features_s(data):

    #### 1.object_id:count ####

    grp_col    = 'district_code' 

    city_code  = 'city_code'

    df = pd.DataFrame()

    df[grp_col]   = data[grp_col].values  

    df[city_code] = data[city_code].values  

    df['ranks']   = data['ranks'].values 

    

    ### 过去N天的统计特征 ####

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for slide_windows in [3,6,13,20,27]:

            print(fea_col, slide_windows)

            

            slide_cols = [ fea_col + '_before_'+ str(i+1) for i in range(slide_windows)]

            slide_cols.append(fea_col)

            

            df_tmp = data[slide_cols].values

            df_tmp_percent =  data[slide_cols].copy()

            

            df_tmp_percent['sum_'] = df_tmp_percent.sum(axis=1).values  

            

            for col in slide_cols:

                df_tmp_percent[col] = df_tmp_percent[col].values / (1e-5 + df_tmp_percent['sum_'].values) # 百分比

            df_tmp_percent = df_tmp_percent[slide_cols].values

             

             

            df_grp = data.groupby(city_code)[slide_cols].sum(axis=1).reset_index()

            df_city_dic = data.groupby(city_code)[slide_cols].sum(axis=1).to_dict()

        

            df['district_' + fea_col + '_last{}_sum'.format(slide_windows)]    =  np.sum(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_median'.format(slide_windows)] =  np.median(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_std'.format(slide_windows)]    =  np.std(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_min'.format(slide_windows)]    =  np.min(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_max'.format(slide_windows)]    =  np.max(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1)

             

            df['district_percent_' + fea_col + '_last{}_median'.format(slide_windows)] =  np.median(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_std'.format(slide_windows)]    =  np.std(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_min'.format(slide_windows)]    =  np.min(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_max'.format(slide_windows)]    =  np.max(df_tmp_percent,axis=1) 

            

            df['district_percent_' + fea_col + '_last{}_skew'.format(slide_windows)]   =  skew(df_tmp,axis=1)

  

    df['flow_in_last_week']  = data['flow_in_before_7'].values

    df['flow_out_last_week'] = data['flow_out_before_7'].values

    df['dwell_last_week']    = data['dwell_before_7'].values  

    

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for slide_windows in range(1,4):

            print(fea_col, slide_windows)

            

            slide_cols = [ fea_col + '_before_'+ str(i * 7 + 7)  for i in range(slide_windows)]

            slide_cols.append(fea_col)

            df_tmp = data[slide_cols].values 

                

            if slide_windows == 1:

                df[fea_col + '_w{}_mean'.format(slide_windows)]   =  np.mean(df_tmp,axis=1)

            else:

                df[fea_col + '_w{}_mean'.format(slide_windows)]   =  np.mean(df_tmp,axis=1)

                df[fea_col + '_w{}_median'.format(slide_windows)] =  np.median(df_tmp,axis=1)

                df[fea_col + '_w{}_std'.format(slide_windows)]    =  np.std(df_tmp,axis=1)

                df[fea_col + '_w{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1) 

    return df

def _get_sts_features_n(data):

    #### 1.object_id:count ####

    grp_col = 'district_code' 

    city_code  = 'city_code'

    df = pd.DataFrame()

    df[grp_col] = data[grp_col].values  

    df[city_code] = data[city_code].values  

    df['ranks'] = data['ranks'].values 

    

    ### 过去N天的统计特征 ####

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for slide_windows in [3,6,13,20,27]:

            print(fea_col, slide_windows)

            

            slide_cols = [ fea_col + '_after_'+ str(i+1) for i in range(slide_windows)]

            slide_cols.append(fea_col)

            

            df_tmp = data[slide_cols].values

            df_tmp_percent =  data[slide_cols].copy()

            

            df_tmp_percent['sum_'] = df_tmp_percent.sum(axis=1).values  

            for col in slide_cols:

                df_tmp_percent[col] = df_tmp_percent[col].values / (1e-5 + df_tmp_percent['sum_'].values) # 百分比

            df_tmp_percent = df_tmp_percent[slide_cols].values

             

             

            df_grp = data.groupby(city_code)[slide_cols].sum(axis=1).reset_index()

            df_city_dic = data.groupby(city_code)[slide_cols].sum(axis=1).to_dict()

        

            df['district_' + fea_col + '_last{}_sum'.format(slide_windows)]    =  np.sum(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_median'.format(slide_windows)] =  np.median(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_std'.format(slide_windows)]    =  np.std(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_min'.format(slide_windows)]    =  np.min(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_max'.format(slide_windows)]    =  np.max(df_tmp,axis=1)

            df['district_' + fea_col + '_last{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1)

            df['district_' + fea_col + '_last{}_skew'.format(slide_windows)]   =  skew(df_tmp,axis=1)

             

            df['district_percent_' + fea_col + '_last{}_median'.format(slide_windows)] =  np.median(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_std'.format(slide_windows)]    =  np.std(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_min'.format(slide_windows)]    =  np.min(df_tmp_percent,axis=1)

            df['district_percent_' + fea_col + '_last{}_max'.format(slide_windows)]    =  np.max(df_tmp_percent,axis=1) 

            df['district_percent_' + fea_col + '_last{}_skew'.format(slide_windows)]   =  skew(df_tmp_percent,axis=1)

               

    df['flow_in_last_week']  = data['flow_in_after_7'].values

    df['flow_out_last_week'] = data['flow_out_after_7'].values

    df['dwell_last_week']    = data['dwell_after_7'].values  

    

    for fea_col in tqdm(['flow_in','flow_out','dwell']): 

        for slide_windows in range(1,4):

            print(fea_col, slide_windows)

            

            slide_cols = [ fea_col + '_after_'+ str(i * 7 + 7)  for i in range(slide_windows)]

            slide_cols.append(fea_col)

            df_tmp = data[slide_cols].values 

                

            if slide_windows == 1:

                df[fea_col + '_w{}_mean'.format(slide_windows)]   =  np.mean(df_tmp,axis=1)

            else:

                df[fea_col + '_w{}_mean'.format(slide_windows)]   =  np.mean(df_tmp,axis=1)

                df[fea_col + '_w{}_median'.format(slide_windows)] =  np.median(df_tmp,axis=1)

                df[fea_col + '_w{}_std'.format(slide_windows)]    =  np.std(df_tmp,axis=1)

                df[fea_col + '_w{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1) 

    return df

%%time

sts_features_n = _get_sts_features_n(df_train_expand_n


         

%%time

sts_features_s = _get_sts_features_s(df_train_expand_s)

         

5  模型验证,训练&测试

5.1  误差函数

# 误差计算函数

def get_error(y_pred, y_true):

    return np.sqrt((np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true),2)))) 

5.2  基础的LGB模型

from sklearn.model_selection import StratifiedKFold

def _get_lgb_models_test(train, train_label, n_estimators = 350):

    lgb_params = {

        'boosting_type': 'gbdt',

        'objective': 'regression', 

        'metric': 'rmse',

        'learning_rate': 0.05, 

        'subsample': 0.9,

        'colsample_bytree': 0.9,

        'n_estimators': n_estimators,

        'silent': -1,

        'verbose': -1,

        'max_depth': 3

    }

     

    model = lgb.LGBMRegressor(**lgb_params)

    model.fit(

        train, train_label,

        eval_set=[(train, train_label)], 

        verbose=50,

        eval_metric = 'rmse',

        early_stopping_rounds=250)      

    return model

5.3  模型训练&测试

5.3.1  正向训练

grp_col  = 'district_code'

pred = []

Sub_N = None

st = 20170823

for test_day in range(1,6):

    test_day = str(test_day)

    

    #################  逆向  ################

    train_label_n = df_train_n.loc[df_train_n['dwell_' + test_day].isnull() == False][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]

    val_label_n   = df_train_n.loc[df_train_n['dwell_' + test_day].isnull() == True][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]

    

    train_label_n = train_label_n.loc[train_label_n.ranks < train_label_n['ranks'].min() +  8] 

    train_label_n = train_label_n.merge(sts_features_n,on=['district_code','ranks'], how= 'left')

    train_label_n = train_label_n.merge(recent_features_n,on=['district_code','ranks'], how= 'left') 

    

    val_label_n = val_label_n.merge(sts_features_n,on=['district_code','ranks'], how= 'left')

    val_label_n = val_label_n.merge(recent_features_n,on=['district_code','ranks'], how= 'left')

    

    

    train_cols = [col for col in train_label_n.columns if 'dwell_'+ test_day not in col and 'flow_in_'+ test_day not in col and 'flow_out_'+ test_day not in col\

              and 'ranks' not in col and train_label_n[col].dtype!='O' and 'month' not in col and 'year' not in col] 

    

    print('model training')

    model_dwell    =  _get_lgb_models_test(train_label_n[train_cols], train_label_n['dwell_'+ test_day].apply(np.log1p).values,n_estimators=350)

    model_flowin   =  _get_lgb_models_test(train_label_n[train_cols], train_label_n['flow_in_'+ test_day].apply(np.log1p).values,n_estimators=350)

    model_flowout  =  _get_lgb_models_test(train_label_n[train_cols], train_label_n['flow_out_'+ test_day].apply(np.log1p).values,n_estimators=350)

    

    

    val_label_n['dwell_'+ test_day+'_predict']    = model_dwell.predict(val_label_n[train_cols])

    val_label_n['flow_in_'+ test_day+'_predict']  = model_flowin.predict(val_label_n[train_cols])

    val_label_n['flow_out_'+ test_day+'_predict'] = model_flowout.predict(val_label_n[train_cols])

    

    dwell_pred_dict    = val_label_n.groupby(grp_col)['dwell_'+ test_day+'_predict'].last().to_dict()

    flow_in_pred_dict  = val_label_n.groupby(grp_col)['flow_in_'+ test_day+'_predict'].last().to_dict()

    flow_out_pred_dict = val_label_n.groupby(grp_col)['flow_out_'+ test_day+'_predict'].last().to_dict()

    

    val_label_n['date_dt'] = st + 1 - int(test_day)

    submit_n = val_label_n[['district_code','date_dt']].copy()

    submit_n = submit_n.drop_duplicates(subset = ['district_code','date_dt']) 

    

    submit_n['dwell']   =  np.expm1(submit_n['district_code'].map(dwell_pred_dict))

    submit_n['flow_in'] =  np.expm1(submit_n['district_code'].map(flow_in_pred_dict))

    submit_n['flow_out'] = np.expm1(submit_n['district_code'].map(flow_out_pred_dict))

    if test_day == '1':

        Sub_N = submit_n

    else:

        Sub_N = pd.concat([Sub_N, submit_n],ignore_index=True)

    print(Sub_N.shape)

      

5.3.2  逆向训练

preds2 = []

grp_col  = 'district_code'

Sub_S = None

st = 20170819

for test_day in range(1,6):

    test_day = str(test_day)

    train_label_s = df_train_s.loc[df_train_s['dwell_' + test_day].isnull() == False][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]

    val_label_s   = df_train_s.loc[df_train_s['dwell_' + test_day].isnull() == True][['dwell_' + test_day,'flow_in_' + test_day,'flow_out_' + test_day,'month','day','day_of_week','dwell','flow_in','flow_out',grp_col,'ranks']]

     

    train_label_s = train_label_s.loc[train_label_s.ranks >= train_label_s['ranks'].max() -  7] 

    train_label_s = train_label_s.merge(sts_features_s,   on=['district_code','ranks'], how= 'left')

    train_label_s = train_label_s.merge(recent_features_s,on=['district_code','ranks'], how= 'left')

    

    val_label_s = val_label_s.merge(sts_features_s,on=['district_code','ranks'], how= 'left')

    val_label_s = val_label_s.merge(recent_features_s,on=['district_code','ranks'], how= 'left') 

    

    train_cols = [col for col in train_label_s.columns if 'dwell_'+ test_day not in col and 'flow_in_'+ test_day not in col and 'flow_out_'+ test_day not in col\

              and 'ranks' not in col and train_label_s[col].dtype!='O' and 'year' not in col] 

    

    print('model training')

    model_dwell    =  _get_lgb_models_test(train_label_s[train_cols], train_label_s['dwell_'+ test_day].apply(np.log1p).values,n_estimators=350)

    model_flowin   =  _get_lgb_models_test(train_label_s[train_cols], train_label_s['flow_in_'+ test_day].apply(np.log1p).values,n_estimators=350)

    model_flowout  =  _get_lgb_models_test(train_label_s[train_cols], train_label_s['flow_out_'+ test_day].apply(np.log1p).values,n_estimators=350)

    

    

    val_label_s['dwell_'+ test_day+'_predict']    = model_dwell.predict(val_label_s[train_cols])

    val_label_s['flow_in_'+ test_day+'_predict']  = model_flowin.predict(val_label_s[train_cols])

    val_label_s['flow_out_'+ test_day+'_predict'] = model_flowout.predict(val_label_s[train_cols])

    

    dwell_pred_dict    = val_label_s.groupby(grp_col)['dwell_'+ test_day+'_predict'].last().to_dict()

    flow_in_pred_dict  = val_label_s.groupby(grp_col)['flow_in_'+ test_day+'_predict'].last().to_dict()

    flow_out_pred_dict = val_label_s.groupby(grp_col)['flow_out_'+ test_day+'_predict'].last().to_dict()

    

    val_label_s['date_dt'] = st - 1 + int(test_day) 

    submit_s = val_label_s[['district_code','date_dt']].copy()

    submit_s = submit_s.drop_duplicates(subset = ['district_code','date_dt']) 

    submit_s['dwell']   =  np.expm1(submit_s['district_code'].map(dwell_pred_dict))

    submit_s['flow_in'] =  np.expm1(submit_s['district_code'].map(flow_in_pred_dict))

    submit_s['flow_out'] = np.expm1(submit_s['district_code'].map(flow_out_pred_dict))

    if test_day == '1':

        Sub_S = submit_s

    else:

        Sub_S = pd.concat([Sub_S, submit_s],ignore_index=True)

        

    print(Sub_S.shape)

     

5.4  格式转换,进行最终提交

from copy import deepcopy

submit_8month = None

submit_8month = deepcopy(Sub_S)

submit_8month = submit_8month.sort_values(['district_code','date_dt'])

Sub_N = Sub_N.sort_values(['district_code','date_dt'])

district_2_city_dic = flow_train.groupby(['district_code'])['city_code'].last().to_dict()

submit_8month['city_code'] = submit_8month['district_code'].map(district_2_city_dic)

5.5  正向逆向融合

weights = [0,0.1,0.5,0.9,1]

st = 20170819

for i in range(5):

    s_weight = 1 - weights[i]

    n_weight = weights[i]

    print( st +i)

    submit_8month.loc[submit_8month['date_dt'] == st +i,'dwell']    = submit_8month.loc[submit_8month['date_dt'] == st +i,'dwell'].values    * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'dwell'].values   * n_weight

    submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_in']  = submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_in'].values  * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'flow_in'].values * n_weight

    submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_out'] = submit_8month.loc[submit_8month['date_dt'] == st +i,'flow_out'].values * s_weight + Sub_N.loc[Sub_N['date_dt'] == st +i,'flow_out'].values* n_weight    

submit_8month.to_csv('sub8.csv',index = None)

   本文中有很多提分的操作,希望大家可以自己动手实践,并找出提分最大的模块。

更多推荐

【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)

本文发布于:2024-02-26 23:09:25,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1704325.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:时序   建模   序列   人口   时间

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!