我需要在每个唯一标识符上运行一堆线性模型,但首先我需要做一个检查。对于每个唯一的ID和年份,我需要检查,至少有24个月以前的每月数据,但不得超过60个月。所以当我运行回归时,应该包括每个人每年的上一个月(年)数据的24-60次观察结果。如果该年度的资料少于24个月,则该年度的年份将会下降,但如果超过60岁,则仅使用60个月。
I need to run a bunch of linear models on each unique identifier, but first I need to do a check. For each unique id and year, I need to check that there are at least 24 months of previous monthly data, but not more than 60 months. So when I run the regression it should include between 24 - 60 observations of previous month (years) data for each year of each individual. If there are less than 24 months of data for that year, the year is dropped for that individual, but if there are more than 60, then only the 60 months are used.
感谢此 (感谢@akrun)的帖子,我能够为每个人设置线性模型,运行它们,然后输出beta作为两者的总和。问题在于,这只能在当年(12个obs)而不是以前的24-60进行回归。
Thanks to this (thanks @akrun) post, I was able to setup the linear models for each individual, run them, and then output the beta as the sum of both betas. The problem is that this only runs the regression on the current year (12 obs) and not the previous 24-60.
编辑:我意识到dput是错误的.sorry
Edit : I realized the dput was wrong...sorry
单个cusip dput:
Single cusip dput:
tdata <- structure(list(cusip = c(101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L), date = c(19901130L, 19901031L, 19900928L, 19900831L, 19900731L, 19900629L, 19900531L, 19900430L, 19900330L, 19900228L, 19900131L, 19891229L, 19891130L, 19891031L, 19890929L, 19890831L, 19890731L, 19890630L, 19890531L, 19890428L, 19890331L, 19890228L, 19890131L, 19881230L, 19881130L, 19881031L, 19880930L, 19880831L, 19880729L, 19880630L, 19880531L, 19880429L, 19880331L, 19880229L, 19880129L, 19871231L, 19871130L, 19871030L, 19870930L, 19870831L, 19870731L, 19870630L, 19870529L, 19870430L, 19870331L, 19870227L, 19870130L, 19861231L, 19861128L, 19861031L, 19860930L, 19860829L, 19860731L), fyear = c("1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1988", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", "1986", "1986", "1986", "1986", "1986", "1986"), month = c("11", "10", "09", "08", "07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07"), ret = c("0.117647", "0.030303", "-0.161017", "-0.186207", "-0.131737", "0.128378", "0.027778", "-0.162791", "0.131579", "0.178295", "-0.091549", "0.163934", "-0.089552", "0.007519", "0.117647", "0.155340", "0.211765", "0.024096", "0.338710", "0.377778", "0.071429", "-0.176471", "0.378378", "-0.026316", "-0.050000", "-0.047619", "-0.086957", "-0.061224", "0.088889", "-0.062500", "-0.040000", "-0.056604", "0.081633", "0.042553", "-0.096154", "0.238095", "-0.263158", "-0.393617", "-0.160714", "0.400000", "-0.090909", "-0.200000", "-0.098361", "-0.152778", "0.000000", "0.107692", "0.460674", "-0.101010", "-0.019802", "0.246914", "-0.052632", "0.179310", "-0.064516"), ewretd = c(0.035468, -0.057155, -0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 0.023269, -0.073128), lagewretd = c(-0.004526, 0.035468, -0.057155, -0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 0.023269)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -53L), .Names = c("cusip", "date", "fyear", "month", "ret", "ewretd", "lagewretd"))dplyr代码:
res1 <- tdata %>% group_by(cusip, fyear) %>% arrange(desc(date)) %>% mutate(n=n()) %>% do(data.frame(., beta=ifelse(.$n > 2, sum(coef(lm(ret~ewretd+lagewretd, data=.))[-1]), NA)))更新2:04/13/2015
Update 2 : 04/13/2015
这是一个 循环,我可以想到这将解决问题,但是再次,R中的循环的不是最有效的解决方案。
Here is a for loop that I could think of that would solve the problem, but again, for loops in R are not the most efficient solution.
for (i : unique(cusip)){ for (j : unique(fyear)){ check <- filter(tdata, fyear == i & fyear == i-1 & fyear == i-2 & fyear == i-3 & fyear == i-4) ifelse(length(check$month < 24), tdata$beta == NA, if(length(check$month >= 60)){ arrange(check, desc(date)), filter(check, month[1:60,]), check$beta <- sum(coef(lm(ret~ewretd+lagewretd, data = check))[-1])), left_join(tdata, check, by=c("cusip", fyear == j))}更新3:完整样本集
这包括所有obs,相当大(323mb)
This includes all obs, which is fairly large (323mb)
全部样本
推荐答案从长远来看,您可能想要使用正确的日期。通过将 fyear 从字符转换为整数,我已经朝着这个方向迈了一小步。
In the long run, you probably want to use proper dates. I've made a small step in that direction by converting fyear from character to integer.
library(dplyr) ## convert fyear to a proper number and then exploit for sorting tdata <- tdata %>% mutate(fyear = fyear %>% as.integer) %>% arrange(fyear, month)然后,我在 fyear 的级别上生成一个 tbl ,计算您将会收到多少个月的数据可用于安装模型。 (我拖动 cusip ,但由于您的数据只包含一个 cusip ,我无法确定这个都可以正常工作。)
Then I make a tbl summarized at the level of fyear, computing how many cumulative months of data you would have available for fitting a model. (I'm dragging cusip around but since your data only includes one cusip, I can't be sure that this is all working correctly.)
## figure out cumulative months available for each year (for each cusip) yearstuff <- tdata %>% group_by(cusip, fyear) %>% summarize(n = n()) %>% mutate(n_cum = cumsum(n)) yearstuff # Source: local data frame [5 x 4] # Groups: cusip # # cusip fyear n n_cum # 1 101 1986 6 6 # 2 101 1987 12 18 # 3 101 1988 12 30 # 4 101 1989 12 42 # 5 101 1990 11 53我没有发现模型适合作为 dplyr ,因为它不太适合 group_by 范例。相反,我使用 plyr :: ddply(),使用 yearstuff 开启东西,然后将每个 cusip * fyear 组合。如果没有足够的数据,如果数据太多,我只会采用最近的60个月。
I don't find the model fitting to be a very natural task for dplyr, since it doesn't fit nicely into the group_by paradigm. Instead I drive things off of yearstuff, using plyr::ddply(), and pull the data I need for each cusip * fyear combination. I decline to fit a model if there is not enough data and if there is too much data, I only take the most recent 60 months.
## iterate over rows of yearstuff (for each cusip) models <- plyr::ddply(yearstuff, ~ cusip + fyear, function(y) { if(y$n_cum < 24) { c('(Intercept)' = NA_real_, ewretd = NA_real_, lagewretd = NA_real_) } else { my_dat <- tdata %>% filter(cusip == y$cusip, fyear <= y$fyear) %>% mutate(rn = row_number(desc(date))) lm(ret ~ ewretd + lagewretd, my_dat, subset = rn < 61) %>% coef } }) models # cusip fyear (Intercept) ewretd lagewretd # 1 101 1986 NA NA NA # 2 101 1987 NA NA NA # 3 101 1988 -0.01138861 1.614342 0.14885911 # 4 101 1989 0.02467139 1.878295 0.00598857 # 5 101 1990 0.02529068 1.900389 0.05766020这将使您按照您的需要使用估计的系数。我认为这应该扩展到多个 cusip ,但谁知道?此数据集也不超过60个月。你应该明白地用手工对这些结果进行一些抽查!
This leaves you with the estimated coefficients to use as you wish. I think this should scale to multiple cusips but who knows?. Also this dataset doesn't contain more than 60 months. You should obviously do some spot checks of these results "by hand"!
更多推荐
dplyr和以前的观察
发布评论