简体   繁体   English

R data.table,列数可变

[英]R data.table with variable number of columns

For each student in a data set, a certain set of scores may have been collected. 对于数据集中的每个学生,可能已经收集了一组特定的分数。 We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student. 我们想要计算每个学生的平均值,但只使用与该学生密切相关的列中的分数。

The columns required in a calculation are different for each row. 计算中所需的列对于每行是不同的。 I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows. 我已经想过如何使用常用工具在R中编写这个,但我试图用data.table重写,部分是为了好玩,但也部分是为了预期这个小项目的成功,这可能导致需要进行计算很多很多行。

Here is a small working example of "choose a specific column set for each row problem." 这是一个“为每行问题选择特定列集”的小工作示例。

set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
              A = sample(c(1:5, 9), 10,  replace = TRUE),
              B = sample(c(1:5, 9), 10, replace = TRUE),
              C = sample(c(1:5, 9), 10, replace = TRUE),
              D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises

## Students in various grades are administered different
## tests.  A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
           "4" = c("A", "C"),
           "5" = c("B", "C", "D"),
           "6" = c("A", "B", "C", "D"),
           "7" = c("C", "D"),
           "8" = c("C"))

## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
    lookup[[as.character(grade)]]
}


## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
    scores <- arow[getLookup(arow["grade"])]
    mean(scores[scores != 9], na.rm = TRUE)
}

stuscores <- apply(dat, 1, function(x) getMean(x, lookup))

result <- data.frame(dat, stuscores)
result

## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.

## Client will want students sorted by state, district, classroom,
## etc.

## However, am stumped on how to specify the adjustable
## column-name chooser

library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)

The desired output is the student average for the appropriate columns, like so: 所需的输出是相应列的学生平均值,如下所示:

> result
  id grade A B C D stuscores
1   1     3 9 9 1 4       NaN
2   2     4 5 4 1 5       3.0
3   3     5 1 3 5 9       4.0
4   4     6 5 2 4 5       4.0
5   5     7 9 1 1 3       2.0
6   6     3 3 3 4 3       3.0
7   7     4 9 2 9 2       NaN
8   8     5 3 9 2 9       2.0
9   9     6 2 3 2 5       3.0
10 10     7 3 2 4 1       2.5

Then what? 那又怎样? I've written a lot of mistakes so far... 到目前为止我写了很多错误......

I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice. 我没有在数据表示例中找到任何示例,其中每行的计算中使用的列本身就是一个变量,我感谢您的建议。

I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem. 我没有要求任何人为我编写代码,我正在征求关于如何开始解决这个问题的建议。

First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed . 首先,当使用诸如sample (每次运行时设置随机种子)等函数创建可重现的示例时,您应该使用set.seed

Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans . 其次,不是循环遍历每一行,您可以循环遍历lookup列表,该列表将始终小于数据(多次显着缩小)并将其与rowMeans结合使用。 You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NA s, but you can try to generalize this to your specific case too) 你也可以用基数R来做,但你要求一个data.table解决方案,所以这里(为了这个解决方案的目的,我已经将所有9转换为NA ,但你也可以尝试将此概括为你的特定情况)

So using set.seed(123) , your function gives 所以使用set.seed(123) ,你的函数给出了

apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000

And here's a possible data.table application which runs only over the lookup list ( for loops on lists are very efficient in R btw, see here ) 这里有一个可能的data.table应用程序,它只在lookup列表上运行( for列表for循环在R btw中非常有效,请参见此处

## convert all 9 values to NAs
is.na(dat) <- dat == 9L 
## convert your original data to `data.table`, 
## there is no need in additional copy of the data if the data is huge
setDT(dat)     
## loop only over the list
for(i in names(lookup)) {
  dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
#     id grade  A  B  C  D      res
#  1:  1     3  2 NA NA NA 2.000000
#  2:  2     4  5  3  5 NA 5.000000
#  3:  3     5  3  5  4  5 4.666667
#  4:  4     6 NA  4 NA  5 4.500000
#  5:  5     7 NA  1  4  1 2.500000
#  6:  6     3  1 NA  5  3 1.000000
#  7:  7     4  4  2  4  5 4.000000
#  8:  8     5 NA  1  4  2 2.333333
#  9: NA     6  4  2  2  2 2.500000
# 10: 10     7  3 NA  1  2 1.500000

Possibly, this could be improved utilizing set , but I can't think of a good way currently. 可能,这可以通过set来改进,但我现在想不出一个好方法。


PS PS

As suggested by @Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD , with = FALSE , etc. 正如@Arun所建议的那样,请看一下他自己在这里写的小插曲,以便熟悉:=运算符, .SDwith = FALSE等。

Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.table s: 这是另一个使用melt.data.tabledata.table方法(需要data.table 1.9.5+),然后在data.table s之间连接:

DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
  variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
  .(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
#    id grade A B C D mean_score
# 1:  1     3 9 9 1 4        NaN
# 2:  2     4 5 4 1 5        3.0
# 3:  3     5 1 3 5 9        4.0
# 4:  4     6 5 2 4 5        4.0
# 5:  5     7 9 1 1 3        2.0
# 6:  6     3 3 3 4 3        3.0
# 7:  7     4 9 2 9 2        NaN
# 8:  8     5 3 9 2 9        2.0
# 9:  9     6 2 3 2 5        3.0
#10: 10     7 3 2 4 1        2.5

It's more verbose, but just over twice as fast: 它更冗长,但速度只有两倍:

microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369  1000
# nk_method()  7.047405  7.282276  7.757005  7.489351  7.667614  20.30658  1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM