简体   繁体   中英

Grouping data into ranges in R

Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.

> mydata  
id  name   marks gender  
1   a1    56     female  
2   a2    37      male  

I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.

I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:

mydata <- data.frame(
  id = 1:100,
  name = paste0("a",1:100),
  marks = sample(20:100,100,TRUE),
  gender = sample(c("female","male"),100,TRUE))

split(mydata,cut(mydata$marks,seq(20,100,by=10)))

I think that @Sacha's answer should suffice for what you need to do, even if you have more than one set.

You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first ( rbind would suffice, as recommended in the comment).

So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.

First, here's some sample data.

# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20, 
                      name = paste("a", 1:20, sep = ""),
                      marks = sample(20:100, 20, replace = TRUE),
                      gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
                      name = paste("b", 1:17, sep = ""),
                      marks = sample(30:100, 17, replace = TRUE),
                      gender = sample(c("F", "M"), 17, replace = TRUE))

Second, different options for "grouping".

  • Option 1 : Return (in a list ) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frame s.

     lapply(list(myData1 = myData1, myData2 = myData2), function(x) x[x$marks >= 30 & x$marks <= 50, ]) 
  • Option 2 : Return (in a list ) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frame s.

     lapply(list(myData1 = myData1, myData2 = myData2), function(x) split(x, x$marks >= 30 & x$marks <= 50)) 
  • Option 3 : More flexible than the first. This is essentially @Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frame s.

     lapply(list(myData1 = myData1, myData2 = myData2), function(x) split(x, cut(x$marks, breaks = c(0, 30, 50, 75, 100), include.lowest = TRUE))) 
  • Option 4 : Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.

     # Combine the data. Assumes all the rownames are the same in both sets myDataALL <- rbind(myData1, myData2) # Extract just the group of scores you're interested in myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ] 
  • Option 5 : Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frame s.

     split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50) 

I hope one of these options serves your needs!

I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :

Step 1 : Define range Step 2 : Find the elements that fall in the range Step 3 : Plot

A sample code is as shown below:

   range = NULL
   for(i in seq(0, max(all$downlink), 2000)){
    range <- c(range, i)
   }
   counts <- numeric(length(range)-1);
   for(i in 1:length(counts)) {
   counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
   }
   countmax = max(counts)
   a = round(countmax/1000)*1000
   barplot(counts, col= rainbow(16), ylim = c(0,a))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM