简体   繁体   English

对R数据框中的分组行执行功能

[英]Performing functions on grouped rows in an R dataframe

I have a large dataframe where multiple rows are repeated measurements for a single ID. 我有一个很大的数据框,其中多个行是针对单个ID的重复测量。 I want to return the rows with the maximum value of a column for each individual. 我想为每个人返回具有列最大值的行。 Essentially performing a group.by() function as per SQL. 本质上按照SQL执行group.by()函数。

Dataframe (for illustrative purposes) 数据框(用于说明目的)

 ID lac pO2 M1 1 80 M1 4 80 M2 2 70 M2 3 70 M3 3 75 M3 5 75 

I want to call max(lac) and return the following results. 我想调用max(lac)并返回以下结果。

 ID lac pO2 M1 4 80 M2 3 70 M3 5 75 

I've had a look around and thought that the by() function might be useful, but haven't had any joy (code below). 我环顾四周,并认为by()函数可能有用,但没有任何乐趣(下面的代码)。

newdf <- by(df, df$ID, max(df$lac))

Error in FUN(X[[1L]], ...) : could not find function "FUN"

I also looked at tapply but this doesn't work because I'm using a dataframe rather than a vector. 我也看了一下tapply,但这行不通,因为我使用的是数据框而不是矢量。

newdf <- tapply(df, df$ID, max)

Error: "arguments must have same length"

I've looked at similar answers , but these haven't helped. 我看过类似的 答案 ,但是这些并没有帮助。 I'd appreciate some input from people more experienced than I! 我将感谢比我更有经验的人们的一些意见!

Edit 编辑

Having dug a little deeper I've uncovered this question which suggests the plyr package might be useful. 深入研究后,我发现了这个问题 ,这表明plyr软件包可能有用。

Try this: 尝试这个:

> by(mtcars, mtcars$cyl, max)
mtcars$cyl: 4
[1] 146.7
--------------------------------------------------------------------------------------- 
mtcars$cyl: 6
[1] 258
--------------------------------------------------------------------------------------- 
mtcars$cyl: 8
[1] 472

Alternatively use plyr : 或者使用plyr

> require(plyr)
Loading required package: plyr
> ddply(mtcars, .(cyl), max)
  cyl    V1
1   4 146.7
2   6 258.0
3   8 472.0

For large data set try data.table (assuming df is your data set) 对于大数据集,请尝试data.table (假设df是您的数据集)

library(data.table)
setDT(df)[, .SD[which.max(lac)], by = ID]

##    ID lac pO2
## 1: M1   4  80
## 2: M2   3  70
## 3: M3   5  75

Found a solution using plyr as discussed in update. 找到了使用plyr的解决方案,如更新中所述。

Code used was: 使用的代码是:

max_lac <- ddply(.data=df, .variables=.(ID), function(x) 
+ x[which(x$lac == max(x$lac)), ])

Here's a dplyr alternative in case you're processing large data sets: 如果您正在处理大型数据集,这是dplyr替代方案:

library(dplyr)

df %>% group_by(ID) %>% filter(lac == max(lac))

#Source: local data frame [3 x 3]
#Groups: ID
#
#  ID lac pO2
#1 M1   4  80
#2 M2   3  70
#3 M3   5  75

Note that in case of multiple rows with maximas in the same group of ID, this function will return all rows containing the maximum value in lac, whereas functions using which.max(.) will only return the first row containing the maximum (per group). 请注意,如果在同一组ID中有多个具有最大值的行,则此函数将返回lac中包含最大值的所有行,而使用which.max(.)函数将仅返回包含最大值(每组which.max(.)的第一行)。

If you only want to return the first max per group, you can use for example: 如果只想返回每个组的第一个最大值,则可以使用例如:

df %>% group_by(ID) %>% filter(1:n() == which.max(lac))

or 要么

df %>% group_by(ID) %>% filter(lac == max(lac)) %>% do(head(.,1))

(如果您查看ddply和which.max)可以吗?:

ddply(df,.ID,function(x){x[which.max(x$lac)),]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM