简体   繁体   English

R-一个数据帧中的几个线性回归与因子和NA

[英]R- Several linear regressions in one dataframe with factors and NA's

I'm very new to R and I have to work with a dataset of more than 100 columns, simplified below: 我是R的新手,我必须处理超过100列的数据集,简化如下:

Station time data1         data2        data3         data4.....
1       0.0  35.02430310   44.2229390   NA
1       0.8  -68.75294241  -85.5847503  NA
1       1.8  -43.10200333  -62.8035400  NA
3       0.0  0.02217693    0.1336396    0.03203031
3       0.9  7.84203118    -6.4854953   6.22910506
3       2.2  -0.41682970   -7.7022785   0.92807170
17      0.0  4.24864888    4.2104517    0.00000000
17      0.9  1.79933934    -6.6360999   -10.10756894
17      2.1  1.99226283    2.2676248    -13.15887674

With every data column I would like to do a linear regression with time , but I need the coefficients for every Station (which are factors). 对于每个data列,我想对time进行线性回归,但我需要每个站的系数(这些是因子)。 From the plyr package I used 从我使用的plyr包装

ddply(dataframe, .(Station), function(z) coef(lm(data1 ~ time, data=z))) 

for example for data1 : 例如对于data1

 Station (Intercept)        t.h.
1  1    9.674588 -40.5399850
2 37    3.130705  -0.6284611
3 48    3.657316  -0.9474062

This would be the way I need the coefficients, but for every data column. 这将是我需要系数的方式,但对于每个data列。 Now, even if I would use this code for every single data column, I get problems with the columns that have NA values. 现在,即使我将此代码用于每个data列,我也会遇到具有NA值的列的问题。 I would like to simply drop these stations, but only for the specific column (in this case only for data3 . For data1 and data2 I would like to keep Station 1. 我想简单地删除这些站,但仅限于特定列(在这种情况下仅用于data3 。对于data1data2我想保留Station 1。

Is there a solution for this? 这有解决方案吗? Any suggestion would be appreciated. 任何建议将不胜感激。

data dput : 数据dput

structure(list(Station = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("1", "3", "17"), class = "factor"), time = c(0, 
0.8, 1.8, 0, 0.9, 2.2, 0, 0.9, 2.2), data1 = c(35.0243031, -68.75294241, 
-43.10200333, 0.02217693, 7.84203118, -0.4168297, 4.24864888, 
1.79933934, 1.99226283), data2 = c(44.222939, -85.5847503, -62.80354, 
0.1336396, -6.4854953, -7.7022785, 4.2104517, -6.6360999, 2.2676248
), data3 = c(NA, NA, NA, 0.1410939, 30.0332505, 11.449285, 0.1161954, 
-2.061781, 0.2289149)), .Names = c("Station", "time", "data1", 
"data2", "data3"), row.names = c(NA, -9L), class = "data.frame")

We need to reshape your data.frame to long format first, then omit NA values, and consequently apply the model per unique key ( 'data' and Station ), and finally tidy up the output from the lm() call. 我们需要data.frame重新data.frame为长格式,然后省略NA值,然后根据唯一键( 'data'Station )应用模型,最后整理lm()调用的输出。

library(tidyr)
library(broom)

df %>% gather(data, value, -c(Station, time)) %>%
  na.omit() %>%
  group_by(data, Station) %>% 
  do(tidy(coef(lm(value ~ time, data = .)))) %>%
  spread(names, x) 

#   data Station `(Intercept)`        time
#* <chr>  <fctr>         <dbl>       <dbl>
#1 data1       1     9.5534021 -40.5734035
#2 data1       3     3.1391280  -0.6354857
#3 data1      17     3.6539549  -0.9424560
#4 data2       1    13.8883780 -56.0886482
#5 data2       3    -1.1964287  -3.3757574
#6 data2      17     0.2938263  -0.3353234
#7 data3       3     9.9859146   3.7631889
#8 data3      17    -0.7504115   0.1724399

The example data used is what you shared up to column data3 . 使用的示例数据是您共享的最新动态栏data3

Using complete.cases will result in Station 1 being dropped completely, is this what you want 使用complete.cases将导致Station 1完全掉线,这就是你想要的

DF=read.table(text="Station time data1         data2        data3         
1       0.0  35.02430310   44.2229390   NA
1       0.8  -68.75294241  -85.5847503  NA
1       1.8  -43.10200333  -62.8035400  NA
3       0.0  0.02217693    0.1336396    0.03203031
3       0.9  7.84203118    -6.4854953   6.22910506
3       2.2  -0.41682970   -7.7022785   0.92807170
17      0.0  4.24864888    4.2104517    0.00000000
17      0.9  1.79933934    -6.6360999   -10.10756894
17      2.1  1.99226283    2.2676248    -13.15887674",header=TRUE,stringsAsFactors=FALSE)




ddply(DF, .(Station), function(z) z[complete.cases(z),]) 
#  Station time       data1      data2        data3
#1       3  0.0  0.02217693  0.1336396   0.03203031
#2       3  0.9  7.84203118 -6.4854953   6.22910506
#3       3  2.2 -0.41682970 -7.7022785   0.92807170
#4      17  0.0  4.24864888  4.2104517   0.00000000
#5      17  0.9  1.79933934 -6.6360999 -10.10756894
#6      17  2.1  1.99226283  2.2676248 -13.15887674

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM