简体   繁体   中英

Assigning Percentile Based Groups to Dataframe in R

I am having trouble figuring out how to take on this particular problem.

Suppose I have the following data frame:

set.seed(123)

Factors <- sample(LETTERS[1:26],50,replace=TRUE)
Values <- sample(c(5,10,15,20,25,30),50,replace=TRUE)
df <- data.frame(Factors,Values)
df

   Factors Values
1        H      5
2        U     15
3        K     25
4        W      5
5        Y     20
6        B     10
7        N      5
8        X     25
9        O     30
10       L     15
11       Y     20
12       L      5
13       R     15
Data goes all the way to row 50, but left out here

Now suppose that I take the sum of Values by Factors

Sum.df <- aggregate(Values ~ Factors, data = df, FUN = sum)
Sum.df

   Factors Values
1        A      5
2        B     35
3        C     25
4        D     30
5        F     30
6        G     75
7        H     20
8        I     55
9        J     20
10       K     60
11       L     20
12       M     20
13       N      5
14       O     55
15       P     20
16       Q     25
17       R     45
18       S     30
19       T     30
20       U     40
21       W     25
22       X     90
23       Y     55
24       Z     15

Then finally I use quantile to find percentile cut offs for the aggregated data.

quantile(Sum.df$Values, probs = c(0.33,.66,1))

  33%   66%  100% 
22.95 35.90 90.00

Okay, so here's my question. What I want to do is create three groups Group 1 , Group 2 , Group 3 based on their quantile. So for example in Sum.df the aggregated value for A is 5 so I want to assign that Factors to Group 1 because 5 is less than 22.95. If the value in Sum.df is greater than 22.95 or less than or equal to 35.9 then assign it to Group 2 and all else assign to Group 3 . What I would love to see is a new column in df that denotes which Group each Factors is in. I hope this makes sense. Thanks guys!

How about the cut function. Just need to include the min in your quantiles.

q <- quantile(Sum.df$Values, probs = c(0, 0.33,.66,1))
Sum.df$group <- cut(Sum.df$Values, q, include.lowest=TRUE,
                    labels=paste("Group", 1:3))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM