【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)

【时间序列】JDD人口动态普查与预测,2018(正逆向<a href=https://www.elefans.com/category/jswz/34/1768946.html style= 时序建模)"/>

【时间序列】JDD人口动态普查与预测,2018(正逆向时序建模)

本文是关于时间序列处理的, 以JDD人口动态普查与预测比赛为例，本文中的注释相对少很多，至于每个部分为何这么做，会不会过拟合等细节希望大家自己揣摩，本文的代码部分偏多,希望大家自己好好学习

大纲:
1 赛题简介

- - 1.1 前言
  - 1.2 背景介绍
  - 1.3 赛题任务
2 工具包 & 数据导入
- - 2.1 工具包导入
  - 2.2 数据导入
3 训练集&验证集划分
- - 3.1 数据准备
  - - 3.1.1 时间转rank
  - 3.2 训练集&测试集划分
  - 3.3 构建标签
  - - 3.3.1 直接往后移动N天快速构建我们的标签
4 特征工程
- - 4.1 最近三天的情况
  - 4.2 传统的统计特征
5 模型验证，训练&测试
- - 5.1 误差函数
  - 5.2 基础的LGB模型
  - 5.3 模型训练&测试
  - 5.3.1 正向训练
  - 5.3.2 逆向训练
  - 5.4 格式转换,进行最终提交
  - 5.5 正向逆向融合

1 赛题简介

赛题链接:
=3dca1a91ad2a4a6da201f125ede9601a

1.1 前言

本次的notebook主要介绍的是一种基于短期时序的传统的建模策略,该策略在短期的时间内预测效果还是非常好的,此次我们以刚刚结束的JDD人口动态普查与预测为例,所有的特征工程都可以对应到我之前写的时序必备系列的5个part中,本篇文章大部分的建模思路和kaggle的比赛类似,不同之处在于我换了一种数据表示方式,加速了kaggle这个比赛的特征提取等过程,具体的各位可以自己揣摩。

1.2 背景介绍

人口普查是政府在各个时期获取人口资料、掌握国情国力的一种最基本的调查方法。人口普查是非常耗时耗力的，新中国成立后，我国只进行过六次全国性的人口普查。在这个数据爆炸式增长，数据科技快速进步的时代，通过人工智能技术借助大数据来估算城市人口，能够让人口普查工作更加高效的完成，节省大量的时间和人力，甚至有可能做到实时动态的人口预测。

1.3 赛题任务

赛题数据均为模拟数据。本赛题要求参赛者利用几个邻近城市的移动通信设备用户数历史变动情况，各区县之间的用户转移情况，以及各个区县内移动通信设备的用户占比（决赛阶段提供）等模拟数据，合理建立预测模型，对上述城市各个区县未来15天的总人口变化情况进行动态预测。本赛题假设一个设备唯一代表一个人，选拔赛阶段总人口的计算口径为移动设备的用户数，决赛阶段总人口的计算口径为移动设备的用户数/移动设备的用户占比。

赛区决赛阶段，我们更换了一批新的数据集，数据来源、时间地点均有所变动，数据格式和选拔赛阶段一致，但我们在训练数据集中模拟去掉了中间5天的数据，参赛选手根据所提供的数据，提交这5天的缺失数据值，同时预测未来10天的数据（对于选拔赛所提供的数据，是否可以继续参考使用，选手可自行评估）。

2 工具包 & 数据导入

2.1 工具包导入

## 数据工具包

import numpy as np

np.random.seed(42)

import pandas as pd

from tqdm import tqdm

## 字符串处理工具包

import string

import re

import gensim

from collections import Counter

import pickle

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import KFold

from keras.preprocessing import text, sequence

import tensorflow as tf

tf.enable_eager_execution()

tfe = tf.contrib.eager

import warnings

warnings.filterwarnings('ignore')

import xgboost as xgb

import lightgbm as lgb

from functools import partial

import os

import gc

from scipy.sparse import vstack

import time

import multiprocessing as mp

import seaborn as sns

%matplotlib inline

2.2 数据导入

flow_train = pd.read_csv('./data/flow_train.csv')

flow_train_s = flow_train.loc[((flow_train.date_dt >= 20170523) & (flow_train.date_dt < 20170819 + 5))]

flow_train_n = flow_train.loc[((flow_train.date_dt >= 20170819) & (flow_train.date_dt <= 20170930))]

3 训练集&验证集划分

为了防止过拟合,建议先划分训练集&测试集

3.1 数据准备

获取年月日,星期 & 将时间转化为rank方便构建label

3.1.1 时间转rank，方便后续建模¶

import datetime

class Date_Process:

def __init__(self):

self.rank_dic = {}

def _dateinfo_trans(self,df):

df['date_dt'] = df['date_dt'].apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d'))

df['year'] = df['date_dt'].map(lambda x:x.year)

df['month'] = df['date_dt'].map(lambda x:x.month)

df['day'] = df['date_dt'].map(lambda x:x.day)

df['day_of_week'] = df['date_dt'].map(lambda x:x.weekday())

return df

def _ranks_fit_transform(self,df):

df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day']

rank_sort = np.sort(df['ranks'].unique())

rank_dic = {}

for i,val in enumerate(rank_sort):

rank_dic[val] = i

df['ranks'] = df['ranks'].map(rank_dic)

self.rank_dic = rank_dic

return df

def _ranks_transform(self,df):

df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day']

try:

df['ranks'] = df['ranks'].map(self.rank_dic)

except:

print('Date not in the same range!')

return df

date_process_s = Date_Process()

flow_train_s = date_process_s._dateinfo_trans(flow_train_s)

flow_train_s = date_process_s._ranks_fit_transform(flow_train_s)

date_process_n = Date_Process()

flow_train_n = date_process_n._dateinfo_trans(flow_train_n)

flow_train_n = date_process_n._ranks_fit_transform(flow_train_n)

3.2 训练集&测试集划分

注意下面的注释部分是用于验证使用

	flow_train_s.tail()

flow_train_data_n = flow_train_n.copy()

flow_train_data_s = flow_train_s.copy()

3.3 构建标签

3.3.1 直接往后移动N天快速构建我们的标签

def _get_traditional_label_n(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]):

for day in tqdm(day_shifts):

for label_col in label_cols:

df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift( day).values # 注意是往后移动,

return df

def _get_traditional_label_s(df, grp_col = 'district_code', label_cols = ['dwell','flow_in','flow_out'], day_shifts = [1,2,3,4,5,6,7,8,9,10]):

for day in tqdm(day_shifts):

for label_col in label_cols:

df[label_col + '_' + str(day)] = df.groupby(grp_col)[label_col].shift(-1 * day).values # 注意是往前移动,不是往后移动

return df

df_train_n = _get_traditional_label_n(flow_train_data_n)

df_train_s = _get_traditional_label_s(flow_train_data_s)

4 特征工程

4.1 最近三天的情况

from functools import partial

from tqdm import tqdm

####### 等价于求最近三天的特征 ##########

def _get_last3_days_feature_n(df):

grp_col = 'district_code'

###### 为了节省时间而采用 #############

for fea_col in tqdm(['flow_in','flow_out','dwell']):

for i in range(1,1+28):

df[fea_col + '_after_{}'.format(i)] = df.groupby(grp_col)[fea_col].shift(-1* i).values

recent_features = pd.DataFrame()

recent_features[grp_col] = df[grp_col].values

recent_features['ranks'] = df['ranks'].values

### 过去三天的特征 #######

for fea_col in tqdm(['flow_in','flow_out','dwell']):

# 过去的特征 #

recent_features[fea_col +'_last_1'] = df[fea_col + '_after_1'].values

recent_features[fea_col +'_last_2'] = df[fea_col + '_after_2'].values

recent_features[fea_col +'_last_3'] = df[fea_col + '_after_3'].values

recent_features[fea_col + '_mean_3'] = (df[fea_col].values + df[fea_col + '_after_1'].values + df[fea_col + '_after_2'].values) / 3.0

# 变化特征 #

recent_features[fea_col + '_diff_1'] = df[fea_col].values - df[fea_col + '_after_1'].values

recent_features[fea_col + '_diff_2'] = df[fea_col + '_after_1'].values - df[fea_col + '_after_2'].values

recent_features[fea_col + '_diff_diff'] = recent_features[fea_col + '_diff_1'].values - recent_features[fea_col + '_diff_2'].values

recent_features[fea_col + '_divide_1'] = df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5)

recent_features[fea_col + '_divide'] = df[fea_col].values / (df[fea_col + '_after_1'].values + 1e-5)

return df,recent_features

def _get_last3_days_feature_s(df):

grp_col = 'district_code'

###### 为了节省时间而采用 #############

for fea_col in tqdm(['flow_in','flow_out','dwell']):

for i in range(1,1+28):

df[fea_col + '_before_{}'.format(i)] = df.groupby(grp_col)[fea_col].shift(i).values

recent_features = pd.DataFrame()

recent_features[grp_col] = df[grp_col].values

recent_features['ranks'] = df['ranks'].values

### 过去三天的特征 #######

for fea_col in tqdm(['flow_in','flow_out','dwell']):

# 过去的特征 #

recent_features[fea_col +'_last_1'] = df[fea_col + '_before_1'].values

recent_features[fea_col +'_last_2'] = df[fea_col + '_before_2'].values

recent_features[fea_col +'_last_3'] = df[fea_col + '_before_3'].values

recent_features[fea_col + '_mean_3'] = (df[fea_col].values + df[fea_col + '_before_1'].values + df[fea_col + '_before_2'].values) / 3.0

# 变化特征 #

recent_features[fea_col + '_diff_1'] = df[fea_col].values - df[fea_col + '_before_1'].values

recent_features[fea_col + '_diff_2'] = df[fea_col + '_before_1'].values - df[fea_col + '_before_2'].values

recent_features[fea_col + '_diff_diff'] = recent_features[fea_col + '_diff_1'].values - recent_features[fea_col + '_diff_2'].values

recent_features[fea_col + '_divide_1'] = df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5)

recent_features[fea_col + '_divide'] = df[fea_col].values / (df[fea_col + '_before_1'].values + 1e-5)

return df,recent_features

df_train_expand_n,recent_features_n = _get_last3_days_feature_n(df_train_n)

df_train_expand_s,recent_features_s = _get_last3_days_feature_s(df_train_s)

4.2 传统的统计特征

from scipy.stats import skew

from tsfresh.feature_extraction import feature_calculators as ts

from tsfresh.feature_extraction import extract_features

from numba import jit

def get_quantile(x, percentiles = [0,0.1, 0.2, 0.3,0.4,0.5, 0.6,0.7,0.8,0.9,1]):

x_len = len(x)

x = np.sort(x)

sts_feas = []

for per_ in percentiles:

if per_ == 1:

sts_feas.append(x[x_len - 1])

else:

sts_feas.append(x[int(x_len * per_)])

return sts_feas

def _get_sts_features_s(data):

#### 1.object_id:count ####

grp_col = 'district_code'

city_code = 'city_code'

df = pd.DataFrame()

df[grp_col] = data[grp_col].values

df[city_code] = data[city_code].values

df['ranks'] = data['ranks'].values

### 过去N天的统计特征 ####

for fea_col in tqdm(['flow_in','flow_out','dwell']):

for slide_windows in [3,6,13,20,27]:

print(fea_col, slide_windows)

slide_cols = [ fea_col + '_before_'+ str(i+1) for i in range(slide_windows)]

slide_cols.append(fea_col)

df_tmp = data[slide_cols].values

df_tmp_percent = data[slide_cols].copy()

df_tmp_percent['sum_'] = df_tmp_percent.sum(axis=1).values

for col in slide_cols:

df_tmp_percent[col] = df_tmp_percent[col].values / (1e-5 + df_tmp_percent['sum_'].values) # 百分比

df_tmp_percent = df_tmp_percent[slide_cols].values

df_grp = data.groupby(city_code)[slide_cols].sum(axis=1).reset_index()

df_city_dic = data.groupby(city_code)[slide_cols].sum(axis=1).to_dict()

df['district_' + fea_col + '_last{}_sum'.format(slide_windows)] = np.sum(df_tmp,axis=1)

df['district_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1)

df['district_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1)

df['district_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp,axis=1)

df['district_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp,axis=1)

df['district_' + fea_col + '_last{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1)

df['district_percent_' + fea_col + '_last{}_median'.format(slide_windows)] = np.median(df_tmp_percent,axis=1)

df['district_percent_' + fea_col + '_last{}_std'.format(slide_windows)] = np.std(df_tmp_percent,axis=1)

df['district_percent_' + fea_col + '_last{}_min'.format(slide_windows)] = np.min(df_tmp_percent,axis=1)

df['district_percent_' + fea_col + '_last{}_max'.format(slide_windows)] = np.max(df_tmp_percent,axis=1)

df['district_percent_' + fea_col + '_last{}_skew'.format(slide_windows)] = skew(df_tmp,axis=1)

df['flow_in_last_week'] = data['flow_in_before_7'].values

df['flow_out_last_week'] = data['flow_out_before_7'].values

df['dwell_last_week'] = data['dwell_before_7'].values

for fea_col in tqdm(['flow_in','flow_out','dwell']):

for slide_windows in range(1,4):

print(fea_col, slide_windows)

slide_cols = [ fea_col + '_before_'+ str(i * 7 + 7) for i in range(slide_windows)]

slide_cols.append(fea_col)

df_tmp = data[slide_cols].values

if slide_windows == 1:

df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1)

else:

df[fea_col + '_w{}_mean'.format(slide_windows)] = np.mean(df_tmp,axis=1)

df[fea_col + '_w{}_median'.format(slide_windows)] = np.median(df_tmp,axis=1)

df[fea_col + '_w{}_std'.format(slide_windows)] = np.std(df_tmp,axis=1)

df[fea_col + '_w{}_mean_change'.format(slide_windows)] = data[slide_cols].apply(ts.mean_change,axis=1)

return df

def _get_sts_features_n(data):

#### 1.object_id:count ####

grp_col = 'district_code'

city_code = 'city_code'

df = pd.DataFrame()

df[grp_col] = data[grp_col].values

df[city_code] = data[city_code].values

df['ranks'] = data['ranks'].values

### 过去N天的统计特征 ####

for fea_col in tqdm(['flow_in','flow_out','dwell']):

for slide_windows in [3,6,13,20,27]:

print(fea_col, slide_windows)

slide_cols = [ fea_col + '_after_'+ str(i+1) for i in range(slide_windows)]