[英]Loop through rows and count number of rows that matches multiple criteria in R
I have a dataset that looks like this: 我有一个看起来像这样的数据集:
city period_day date
1 barcelona morning 2017-01-15
2 sao_paulo afternoon 2016-12-07
3 sao_paulo morning 2016-11-16
4 barcelona morning 2016-11-06
5 barcelona afternoon 2016-12-31
6 sao_paulo afternoon 2016-11-30
7 barcelona morning 2016-10-15
8 barcelona afternoon 2016-11-30
9 sao_paulo afternoon 2016-12-24
10 sao_paulo afternoon 2017-02-02
For every row, I want to count how many rows have an older date than the date of the row, both for city and period_day. 对于每一行,我想计算有多少行的日期早于该行的日期(对于city和period_day)。 In this case, I want this result: 在这种情况下,我想要这样的结果:
city period_day date row_count
1 barcelona morning 2017-01-15 2
2 sao_paulo afternoon 2016-12-07 1
3 sao_paulo morning 2016-11-16 0
4 barcelona morning 2016-11-06 1
5 barcelona afternoon 2016-12-31 1
6 sao_paulo afternoon 2016-11-30 0
7 barcelona morning 2016-10-15 0
8 barcelona afternoon 2016-11-30 0
9 sao_paulo afternoon 2016-12-24 2
10 sao_paulo afternoon 2017-02-02 3
When row_count equals to 0, it means that it's the older date. 当row_count等于0时,表示它是较旧的日期。
I came up with a solution, but it took too long with more data. 我想出了一个解决方案,但是花了太多时间来处理更多数据。 That's the code: 那是代码:
get_count_function <- function(df) {
idx <- 1:nrow(df)
count <- sapply(idx, function(x) {
name_city <-
df %>% select(city) %>% filter(row_number() == x) %>% pull()
name_period <-
df %>% select(period_day) %>% filter(row_number() == x) %>% pull()
date_row <- df %>%
select(date) %>%
filter(row_number() == x) %>%
pull()
date_any_row <- df %>%
filter(dplyr::row_number() != x,
city == name_city,
period_day == name_period) %>%
select(date) %>%
pull()
how_many <- sum(date_row > date_any_row)
return(how_many)
})
return(count)
}
How could I turn this function more efficient? 如何提高此功能的效率?
Try this one: 试试这个:
library(tidyverse)
dat %>%
group_by(city, period_day) %>%
mutate(row_count = order(date) - 1) %>%
ungroup()
When you call order
it returns indices, pointing to the order of the value in a selected group of values ( date
). 调用order
它返回索引,指向选定值组( date
)中值的顺序 。 Subtracting 1
from the indices, you obtain the count of values preceding current value, in a particular group. 从索引中减去1
,可以得出特定组中当前值之前的值计数 。 Eg if it is the min. 例如,如果这是分钟。 value in a group, it has index 1
, so nothing preceding it ( 1 - 1 = 0
), if the index is 2
- only one value is preceding it (one older date
before it) etc. 值在一个组中,它的索引为1
,因此在索引的前面没有任何值( 1 - 1 = 0
),如果索引为2
仅在它前面有一个值(在它之前一个较早的date
)等
Data: 数据:
dat <- read.table(
text = " city period_day date
barcelona morning 2017-01-15
sao_paulo afternoon 2016-12-07
sao_paulo morning 2016-11-16
barcelona morning 2016-11-06
barcelona afternoon 2016-12-31
sao_paulo afternoon 2016-11-30
barcelona morning 2016-10-15
barcelona afternoon 2016-11-30
sao_paulo afternoon 2016-12-24
sao_paulo afternoon 2017-02-02",
header = T,
colClasses = c("character", "character", "Date")
)
This should work if you are willing to use the data.table
package: 如果您愿意使用data.table
包,这应该可以工作:
library(data.table)
dat <- read.table(header=T, row.names=1, text="
city period_day date
1 barcelona morning 2017-01-15
2 sao_paulo afternoon 2016-12-07
3 sao_paulo morning 2016-11-16
4 barcelona morning 2016-11-06
5 barcelona afternoon 2016-12-31
6 sao_paulo afternoon 2016-11-30
7 barcelona morning 2016-10-15
8 barcelona afternoon 2016-11-30
9 sao_paulo afternoon 2016-12-24
10 sao_paulo afternoon 2017-02-02
")
dat <- as.data.table(dat)
dat[, row_count := (order(as.Date(date)) - 1), by=.(city, period_day)]
# Check
dat
## city period_day date row_count
## 1: barcelona morning 2017-01-15 2
## 2: sao_paulo afternoon 2016-12-07 1
## 3: sao_paulo morning 2016-11-16 0
## 4: barcelona morning 2016-11-06 1
## 5: barcelona afternoon 2016-12-31 1
## 6: sao_paulo afternoon 2016-11-30 0
## 7: barcelona morning 2016-10-15 0
## 8: barcelona afternoon 2016-11-30 0
## 9: sao_paulo afternoon 2016-12-24 2
## 10: sao_paulo afternoon 2017-02-02 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.