简体   繁体   中英

How to count the number of rows a condition has been met

I have a dataframe df of stock liquidity ( dput statement below), and I need to measure how many of the previous consecutive three rows the value has been > 10:

  date     sec_id  liquidity   good       count.good.rows
2016-07-29   3277  9.142245 FALSE               0
2016-08-31   3277 11.070555  TRUE               0
2016-09-30   3277 11.934113  TRUE               1
2016-10-31   3277 12.192237  TRUE               2
2016-11-30   3277 10.165183  TRUE               3
2016-12-30   3277  8.414033 FALSE               3
2016-01-29   3426  6.494181 FALSE               0
2016-02-29   3426  8.216213 FALSE               0
2016-03-31   3426 10.081115  TRUE               0
2016-04-29   3426 10.119685  TRUE               1
2016-05-31   3426  8.659732 FALSE               2
2016-06-30   3426  6.790178 FALSE               1
2016-07-29   3426  7.234159 FALSE               0

Note a few things about the data:

  1. There are multiple sec_id values, and I need to do this work on each sec_id value, based on the order of the data column.
  2. I've already added the good column, but can't figure out how to do the count.good.rows column without explicitly using lag(...,1) + lag(...,2) + lag(...,3) . This would be a poor solution because I need the 3 to be a variable (I may end up wanting to look at previous 2 rows, or 4 rows).

Any ideas?

Here's my full dput :

df = structure(list(date = structure(c(16829, 16860, 16891, 16920, 16952, 16982, 17011, 17044, 17074, 17105, 17135, 17165, 16829, 16860, 16891, 16920, 16952, 16982, 17011, 17044, 17074, 17105, 17135, 17165), class = "Date"),
    sec_id = c(3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L),
    liquidity = c(4.014428, 3.779665, 4.833813, 5.244417, 7.150838, 7.639399, 9.142245, 11.070555, 11.934113, 12.192237, 10.165183, 8.414033, 6.494181, 8.216213, 10.081115, 10.119685, 8.659732, 6.790178, 7.234159, 8.529101, 9.015898, 8.307979, 8.231237, 8.711095),
    good = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
    class = "data.frame", .Names = c("date", "sec_id", "liquidity", "good"), 
    row.names = c(NA, -24L))

You could define a lagcounter function which substracts to the cumulative sum of good the cumulative sum n rows back:

lagcounter = function(x,n) {y = cumsum(x); lag(y-lag(y,n,default=0),default=0)}

And then using a dplyr syntax, use this newly defined function in a mutate statement, after having grouped by sec_id :

library(dplyr)
df %>% group_by(sec_id) %>% mutate(count.good.rows = lagcounter(good,3)) 

         date sec_id liquidity  good count.good.rows
1  2016-01-29   3277  4.014428 FALSE               0
2  2016-02-29   3277  3.779665 FALSE               0
3  2016-03-31   3277  4.833813 FALSE               0
4  2016-04-29   3277  5.244417 FALSE               0
5  2016-05-31   3277  7.150838 FALSE               0
6  2016-06-30   3277  7.639399 FALSE               0
7  2016-07-29   3277  9.142245 FALSE               0
8  2016-08-31   3277 11.070555  TRUE               0
9  2016-09-30   3277 11.934113  TRUE               1
10 2016-10-31   3277 12.192237  TRUE               2
11 2016-11-30   3277 10.165183  TRUE               3
12 2016-12-30   3277  8.414033 FALSE               3
13 2016-01-29   3426  6.494181 FALSE               0
14 2016-02-29   3426  8.216213 FALSE               0
15 2016-03-31   3426 10.081115  TRUE               0
16 2016-04-29   3426 10.119685  TRUE               1
17 2016-05-31   3426  8.659732 FALSE               2
18 2016-06-30   3426  6.790178 FALSE               2
19 2016-07-29   3426  7.234159 FALSE               1
20 2016-08-31   3426  8.529101 FALSE               0
21 2016-09-30   3426  9.015898 FALSE               0
22 2016-10-31   3426  8.307979 FALSE               0
23 2016-11-30   3426  8.231237 FALSE               0
24 2016-12-30   3426  8.711095 FALSE               0

Try zoo::rollapply

library(zoo)
df %>%
  group_by(sec_id) %>%
  mutate(count_good_rows = rollapply(good, 3, sum, align="right", partial=TRUE))

# A tibble: 13 x 5
# Groups: sec_id [2]
   # date       sec_id liquidity good  count_good_rows
   # <fctr>      <int>     <dbl> <lgl>           <int>
 # 1 2016-07-29   3277      9.14 F                   0
 # 2 2016-08-31   3277     11.1  T                   1
 # 3 2016-09-30   3277     11.9  T                   2
 # 4 2016-10-31   3277     12.2  T                   3
 # 5 2016-11-30   3277     10.2  T                   3
 # 6 2016-12-30   3277      8.41 F                   2
 # 7 2016-01-29   3426      6.49 F                   0
 # 8 2016-02-29   3426      8.22 F                   0
 # 9 2016-03-31   3426     10.1  T                   1
# 10 2016-04-29   3426     10.1  T                   2
# 11 2016-05-31   3426      8.66 F                   2
# 12 2016-06-30   3426      6.79 F                   1
# 13 2016-07-29   3426      7.23 F                   0

Edit If you're only interested in counting the previous three rows

df %>%
  group_by(sec_id) %>%
  mutate(count_good_rows = rollapply(dplyr::lag(good, 1), 3, function(i) sum(i, na.rm=TRUE), align="right", partial=TRUE))

# A tibble: 13 x 5
# Groups: sec_id [2]
   # date       sec_id liquidity good  count_good_rows
   # <fctr>      <int>     <dbl> <lgl>           <int>
 # 1 2016-07-29   3277      9.14 F                   0
 # 2 2016-08-31   3277     11.1  T                   0
 # 3 2016-09-30   3277     11.9  T                   1
 # 4 2016-10-31   3277     12.2  T                   2
 # 5 2016-11-30   3277     10.2  T                   3
 # 6 2016-12-30   3277      8.41 F                   3
 # 7 2016-01-29   3426      6.49 F                   0
 # 8 2016-02-29   3426      8.22 F                   0
 # 9 2016-03-31   3426     10.1  T                   0
# 10 2016-04-29   3426     10.1  T                   1
# 11 2016-05-31   3426      8.66 F                   2
# 12 2016-06-30   3426      6.79 F                   2
# 13 2016-07-29   3426      7.23 F                   1

Data

df <- read.table(text="date     sec_id  liquidity   good     
2016-07-29   3277  9.142245 FALSE 
2016-08-31   3277 11.070555  TRUE  
2016-09-30   3277 11.934113  TRUE  
2016-10-31   3277 12.192237  TRUE  
2016-11-30   3277 10.165183  TRUE  
2016-12-30   3277  8.414033 FALSE 
2016-01-29   3426  6.494181 FALSE 
2016-02-29   3426  8.216213 FALSE 
2016-03-31   3426 10.081115  TRUE  
2016-04-29   3426 10.119685  TRUE  
2016-05-31   3426  8.659732 FALSE 
2016-06-30   3426  6.790178 FALSE  
2016-07-29   3426  7.234159 FALSE  ", header=TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM