简体   繁体   中英

Count groups in a column based on specific rules in R?

I'm trying to count groups in a column of data... but the count is based on some specific rules. In my column I have variable names (which I call values throughout this post) and NAs . Below are the rules and an example to illustrate what Im trying to do.

Rule 1: If 2 values appear beside each other (row-wise), then increase the count by 1.

Rule 2: If a selected value within the column of data has an NA and then a value in the 2 rows preceding it, then only increase the count by 1 if the total number of preceding values (not including NAs ) is odd.

I know that can sound confusing, so hopefully my example will explain more clearly.

For example, if I have a data frame that looks like this:

library(dplyr)
df <- tibble(name = c("A", NA, "B", "C", NA, NA, NA,
                          "A", NA, "B", NA, NA,
                          "A", "B", "C", NA, NA, NA, "D", NA, "E", NA, NA),
                 num = c(1,1,1,1,1,1,1,
                         2,2,2,2,2,
                         3,3,3,3,3,3,3,3,3,3,3))
>df
# A tibble: 23 × 2
   name    num
   <chr> <dbl>
 1 A         1
 2 NA        1
 3 B         1
 4 C         1
 5 NA        1
 6 NA        1
 7 NA        1
 8 A         2
 9 NA        2
10 B         2
11 NA        2
12 NA        2
13 A         3
14 B         3
15 C         3
16 NA        3
17 NA        3
18 NA        3
19 D         3
20 NA        3
21 E         3
22 NA        3
23 NA        3

So, if we just look at the rows where num == 1 (ie, df[1:7,] ) and we apply our rules, we can see that nothing happens for rows 1 & 2. But on row 3 we can apply RULE 2. Because the 2 rows preceding B are NA and A . And since the number of values preceding B is odd (ie, the only preceding value is A and we don't include NAs ), we increase the count by 1.

Next we can see in rows 3 and 4 that we have two values beside each other (ie, B, C ) so we increase the count by 1.

So, when num == 1 we would get a count of 2.

Just for clarity, applying the same rules when num == 3 . Here, we can apply RULE 1 twice because we have A, B and B, C , so we increase the count by 2. Then for rows 16:19, nothing would happen and the count does not increase. Additionally, when we get to rows 19:21, we can see that the value E has an NA and D preceding it. So we apply RULE 2 and see if we increase the count. If we count all the values in this group that are above E , we can see that there is an even amount (ie, A,B,C,D [excluding the NAs ), so we do not increase the count.

So, when num == 3 we would get a count of 2.

I hope that makes things clear... my desired output would be something like:

    num count
  <dbl> <dbl>
1     1     2
2     2     1
3     3     2

Any suggestions as to how I could do this?

Here is a dplyr solution.

library(dplyr)

should_increment <- function(x) {
  vals <- !is.na(x)
  l1_vals <- !is.na(lag(x))
  l2_vals <- !is.na(lag(x, 2L))
  vals & (l1_vals | (l2_vals & cumsum(vals) %% 2L == 0L))
}

df %>% group_by(num) %>% summarize(count = sum(should_increment(name)))

Output

# A tibble: 3 x 2
    num count
  <dbl> <int>
1     1     2
2     2     1
3     3     2

Simply speaking, the logic is (in pseudo-code)

if (current observation is not NA) {
  if (lag observation is not NA) {
    increase count by 1
  } else {
    if (lag 2 observation is not NA) {
      if (cumulative count of all non NA entries until the current one is even) {
        increase count by 1
      }
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM