简体   繁体   中英

Using R summarise() with group_by(), while referencing other columns

I have a dataset (data1) with 4 columns, and I've been trying to perform a variety of different summarise functions to group the data.

The columns are A( PERSON_ID ), which is just a person ID, B( LIST_ITEMS ), which is a list of object IDs they've purchased in a list (like, c("V5","32") or "45" ) and so on. I kept them as chars because they're IDs regardless. Columns C( EXPENDITURE ) and D( RATE ) are two variables, C is how much they totally spent, and when I use summarise, I'm just taking the sum of C to aggregate. For D however, I want to try something that references C. Basically, I want to take the value of D that corresponds to a quantile of C. (Each person had a different rate, I want, lets say the 50th percentile of that) For instance, my code so far looks like:

data2<-data1 %>% 
unnest(LIST_ITEMS) %>%
group_by(PERSON_ID, EXPENDITURE, RATE) %>%
summarise(LIST_ITEMS= list(sort(LIST_ITEMS)), .groups = 'drop') %>%
group_by(LIST_ITEMS) %>%
summarise(EXPENDITURE=sum(EXPENDITURE), RATE=RATE[Nth percentile of EXPENDITURE])

Now this could be done by maybe sorting EXPENDITURE (or column C for ease) and then taking the cumulative sum and then pick the value that corresponds to when that sum reaches 50% of the sum, but that feels like a convoluted way to do it, and these are discrete values. Lets say, that after group_by , the grouped data for one value of column B looks like this:

structure(list(A = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), 
B = list(c("45","33"), c("45","33"), c("45","33"), c("45","33"), c("45","33"), c("45","33"), c("45","33"), c("45","33"), c("45","33")), 
C = c(600L, 200L, 500L, 200L, 300L, 400L, 300L, 400L, 100L), 
D = c(40L, 20L, 100L, 40L, 30L, 80L, 60L, 50L, 100L)), 
Names = c("A", "B", "C", "D"), 
row.names = c(NA, 9L), class = "data.frame")

(I couldn't put it in as a table because stack overflow gives me an error saying it detects incorrectly formatted code, so the entire table needed the triple backquote around it before the website would let me post the question)

Now, lets say the nth percentile I want is the 50th, I'd basically want it to take an ascending sort of column D(since the rate starts at the lowest and goes up), and then take a cumulative sum of column C(sum of column C is 3000), so in the sorted list, I'd take the cumulative sum, and then get the value of D that corresponds to when the cumulative sum hits 50% of 3000, the sum of C.

Now, in the sorted order, I get 200+300+600+200=1300 . The next row in the sorted list is | 8 | c("45","33")|400|50| | 8 | c("45","33")|400|50| , bringing the cumulative sum to 1700, which means it has crossed the 50th percentile mark, so I'd want my function to return the value 40, as it is the closest value in the floor direction.

How would I design such a function. The sample output for this example I've given is:

B C D
c(45,33) 3000 40

Is there an easy way to perform such an operation?

You can take help of findInterval -

library(dplyr)

perc <- 0.5

df %>%
  arrange(B, D) %>%
  group_by(B) %>%
  summarise(val = findInterval(sum(C) * perc, cumsum(C)), 
            C = sum(C), 
            D = D[val]) %>%
  select(-val)

#     B         C     D
#  <list>    <int> <int>
#1 <chr [2]>  3000    40

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM