简体   繁体   中英

How do I create order statistics by group in R?

How do I compute order statistics by group in R. I want to aggregate results according to a column and then return only 1 row per group. That row should be the n-th element of the group according to some ordering. Ideally I would like to use base functions only.

x <- data.frame(Group=c("A","A", "A", "C", "C"), 
                Name=c("v", "u", "w", "x", "y"), 
                Quantity=c(3,3,4,2,0))
> x
  Group Name Quantity
1     A    v        3
2     A    u        3
3     A    w        4
4     C    x        2
5     C    y        0

I want to take nth highest based on an ordering on Quantity and then Name. For N=2 this is

  Group Name Quantity
1     A    u        3
5     C    y        0

For N=1
  Group Name Quantity
3     A    w        4
4     C    x        2

I tried the following but I get a uninformative error message.

 aggregate.data.frame(x, list(x$Group), function(y){ max(y[,'Quantity'])})
 Error in `[.default`(y, , "Quantity") (from #1) : incorrect number of dimensions"
x <- 
    data.frame(
        Group = c("A","A", "A", "C", "C", "A", "A") , 
        Name = c("v", "u", "w", "x", "y" ,"v", "u") , 
        Quantity = c(3,3,4,2,0,4,1)
    )

# sort your data to start..
# note that Quantity vs. Group and Name
# are sorted in different directions,
# so the -as.numeric() flips them
x <- 
    x[ 
        order( 
            -as.numeric( x$Group ) , 
            x$Quantity , 
            -as.numeric( x$Name ) , 
            decreasing = TRUE 
        ) , 
    ]
# once your data frame is sorted the way you want your Ns to occur, the rest is easy

# rank your data..  
# just create the numerical order, 
# but within each group..
# (or you could add those ranks directly to the data frame if you like)
ranks <- 
    unlist( 
        tapply( 
            order( x$Group ) , 
            as.numeric( x$Group ) , 
            order 
        ) 
    )

# N = 1
x[ ranks == 1 , ]

# N = 2
x[ ranks == 2 , ]

Some aggregate-merge magic:

f <- function(x, N) {
  sel <- function(x) {                                   # Choose the N-th highest value from the set, or lowest element if there < N unique elements.  Is there a built-in for this? 
    z <- unique(x)                                       # This assums that you wan the N-th highest unique value.  Simply don't filter by unique if not.
    z[order(z, decreasing=TRUE)][min(N, length(z))]
  }

  xNq <- aggregate(Quantity ~ Group, data=x,   sel)      # Choose the N-th highest quantity within each "Group"
  xNm <- merge(x, xNq)                                   # Add the matching "Name" values
  x <- aggregate(Name ~ Quantity + Group, data=xNm, sel) # Choose the N-th highest Name in each group
  x[c('Group', 'Name', 'Quantity')]                      # Put into original order
}


> f(x, 2)
##   Group Name Quantity
## 1     A    u        3
## 2     C    y        0

> f(x, 1)
##   Group Name Quantity
## 1     A    w        4
## 2     C    x        2
# define ordering function, increasing on Quantity, decreasing on Name
in.order <- function(group) with(group, group[order(Quantity, -rank(Name)), ])

# set desired rank for each Group
N <- 2

# get Nth row by Group, according to in.order
group.rows <- by(x, x$Group, function(group) head(tail(in.order(group), N), 1))

# collapse rows into data.frame
do.call(rbind, group.rows)

#   Group Name Quantity
# A     A    u        3
# C     C    y        0

The reason you're seeing that error with aggregate.data.frame is because this function applies FUN over each column, according to the by argument, not each subset of the full data.frame (that's what the by function is for, as you can see above). When using aggregate , whatever you supply to FUN should be accepting columns, not data.frame s. In your example, you're trying to index vector y like a data.frame , hence the dimension error.

I went with

do.call(rbind, by(x, x$Group, function(x)
      x[order(-x$Quantity, x$Name),][1,]))

per someone else's suggestion. I found it suited my thought process a bit better than the other posted solutions (which I appreciate).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM