R data.table选择基于组的最大值列

Question

I saw multiple posts to meet my requirement, but some how not able to get the needed result. 我看到多个帖子符合我的要求，但有些帖子无法获得所需的结果。

I have a data.table with multiple columns. 我有一个包含多列的data.table。 Out of all the columns I want to select few columns for their maximum value and summarize them by the group variable. 在所有列中，我想为最大值选择几列，并通过组变量对它们进行汇总。

Below is how my sample data - 以下是我的样本数据 -

library("data.table")
set.seed(1200)

ID <- seq(1001,1100)
region <- sample(1:10,100,replace = T)
Q21 <- sample(1:5,100,replace = T)
Q22 <- sample(1:15,100,replace = T)
Q24_LOC_1 <- sample(1:8,100,replace = T)
Q24_LOC_2 <- sample(1:8,100,replace = T)
Q24_LOC_3 <- sample(1:8,100,replace = T)
Q24_LOC_4 <- sample(1:8,100,replace = T)

Q21_PAN <- sample(1:5,100,replace = T)
Q22_PAN <- sample(1:15,100,replace = T)
Q24_LOC_1_PAN <- sample(1:8,100,replace = T)
Q24_LOC_2_PAN <- sample(1:8,100,replace = T)
Q24_LOC_3_PAN <- sample(1:8,100,replace = T)
Q24_LOC_4_PAN <- sample(1:8,100,replace = T)

df1 <- as.data.table(data.frame(ID,region,Q21,Q22,Q24_LOC_1,Q24_LOC_2,Q24_LOC_3,Q24_LOC_4,Q21_PAN,Q22_PAN,Q24_LOC_1_PAN,Q24_LOC_2_PAN,Q24_LOC_3_PAN,Q24_LOC_4_PAN))

Now for the above data I want to select 4 columns for their maximum value by region. 现在对于上面的数据，我想按区域选择4列作为最大值。 So the result should have the ID variable, region variable and these 4 variables with 10 rows. 所以结果应该有ID变量，区域变量和这4个变量有10行。 1 row for each region. 每个区域有1行。 I tried below code but it creates a column as mycol and puts the value of the 4th element of vector mycol 我尝试了下面的代码但是它创建了一个列作为mycol并且放置了vector mycol的第4个元素的值

mycol <- paste("Q24","LOC",seq(1:4),"PAN",sep = "_")

df2 <- df1[,.(mycol = max(mycol)),by=region]

Please suggest where I am going wrong and how I can achieve this. 请告诉我出错的地方以及如何实现这一目标。

Answer 1

If we need to get max , after grouping by 'region' and specifying the 'mycol' in .SDcols , loop through the Subset of Data.table ( .SD ) and get the max 如果我们需要获得max ，在按“区域”分组并在.SDcols指定“mycol” .SDcols ，循环遍历Data.table（ .SD ）的子集并获得max

df1[, lapply(.SD, max), by = region, .SDcols = mycol]

If there are 'region' that have only NA values, the max will return with a warning as it returns Inf . 如果存在仅具有NA值的“区域”，则在返回Inf ， max将返回警告。 For example, 例如，

max(c(NA, NA), na.rm = TRUE)
#[1] -Inf

Warning message: In max(c(NA, NA), na.rm = TRUE) : no non-missing arguments to max; 警告消息：在max（c（NA，NA），na.rm = TRUE）：max没有非缺失参数; returning -Inf 返回-Inf

To correct this, we could add an if/else condition 要纠正这个问题，我们可以添加if/else条件

df1[, lapply(.SD, function(x) if(all(is.na(x))) NA_integer_
       else max(x, na.rm = TRUE)), by = region, .SDcols = mycol]

If we also need 'ID' as a paste d string 如果我们还需要'ID'作为paste d字符串

df1[, c(list(ID = toString(ID)), lapply(.SD, max)), by = region, .SDcols = mycol]

R data.table选择基于组的最大值列

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-04-20 05:57:38

R data.table选择基于组的最大值列

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-04-20 05:57:38

解决方案1
3 已采纳 2018-04-20 05:57:38