使用lapply在R中由两列分组的循环回归模型

Question

问题：

我有一个看起来像这样的数据框：

YEAR    Region    Illness_Code    Illness_description    COUNT
2014    A         ABC             test                   222
2015    A         ABC             test                   122
2016    A         ABC             test                   111
2014    B         XYZ             testttt                333
2015    B         XYZ             testttt                3232
2016    B         XYZ             testttt                123
2014    C         ABC             test                   333
2015    C         ABC             test                   123
2016    C         ABC             test                   123
.....

我只能得到每种distinct illnesses的系数，而不能得到每个region的系数。

下面是使用的代码：

# Get only illnesses which occurs every year
df <- df %>% 
    group_by(Illness_Code) %>% 
    filter(n() == 3)
# To dataframe
df <- data.frame(df)

# Loop through the dataframe and apply model
out <- lapply(
              unique(df$Illness_Code),
              function(c){
                sub_cases <- subset(df, Illness_Code == c)
                m <- lm(formula = COUNT ~ YEAR, data = sub_cases)
                coef(m)
              })
# Format the data
out <- do.call(rbind, out)

# Make it a dataframe
out <- data.frame(out)

结果出来像这样：

  X.Intercept.    YEAR
1     37254.05 -787.33
2     30745.21 3005.84
3      6992.99 2480.82
4      8391.65 3521.96
5     19298.03 -345.88
6     15163.82 -438.50

我想要得到的是每个region每种distinct illnesses的coefficients 。

题：

如何按distinct illnesses和region其分组？

因此结果应为：

Region    Illness_Code  Illness_description Intercept Slope  COUNT_2016
A         ABC           test                  222.123    15  111
A         XYZ           testttt               122.222 121.1  222
B         ABC           test                  ...     ...    ...
B         XYZ           testttt                            
C         ABC           test                   
C         XYZ           testttt                                   
.....

Answer 1

library(dplyr)
library(tidyr) #nest
library(broom) #tidy
library(purrr) #map

df %>% group_by(Region,Illness_Code) %>% nest() %>% 
      mutate(fit=map(data, ~lm(COUNT~YEAR, data = .)), results = map(fit, tidy)) %>%
      unnest(results)

# A tibble: 6 x 7
Region Illness_Code term         estimate std.error statistic p.value
<fct>  <fct>        <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 A      ABC          (Intercept)  111984.    51770.     2.16     0.276
2 A      ABC          YEAR            -55.5      25.7   -2.16     0.276
3 B      XYZ          (Intercept)  212804.  3494736.     0.0609   0.961
4 B      XYZ          YEAR           -105.     1734.    -0.0605   0.962
5 C      ABC          (Intercept)  211768.   122153.     1.73     0.333
6 C      ABC          YEAR           -105.       60.6   -1.73     0.333

使用lapply和split

#Identify list elements with nrow greater than one
Ind <- sapply(split(df1, list(df1$Region,df1$Illness_Code)), function(x)nrow(x)>1) 

lapply(
  #Loop only throught list elements wiht nrow>1
  split(df, list(df$Region,df$Illness_Code))[Ind],
  function(x){
    #browser()
    m <- lm(formula = COUNT ~ YEAR, data = x)
    #coef(m)
    as.data.frame(cbind(t(coef(m)), 'Year_2016'=x[x$YEAR==2016,'COUNT']))
  })

默认情况下split(df1, list(df1$Region,df1$Illness_Code))将生成一个列表，其中包含Region和Illness_Code级别之间的所有交互，但是其中一些交互的nrow = 0，例如$B.ABC和$A.XYZ这会在以后引起问题，因此我们需要使用指示器将其移除

使用lapply在R中由两列分组的循环回归模型

问题描述

问题：

题：

1 个解决方案

解决方案1
2 已采纳 2018-12-05 08:05:55

使用lapply在R中由两列分组的循环回归模型

问题描述

问题：

题：

1 个解决方案

解决方案1 2 已采纳 2018-12-05 08:05:55

解决方案1
2 已采纳 2018-12-05 08:05:55