[英]What is the most efficient way to add a column that is a binary indicator of a recurring number in time series dataframe?
I have a dataframe that is similar to this example dataframe:我有一个类似于此示例数据框的数据框:
example <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
amount = c(2300, 1765, 2300, 1500, 35, 180, 180),
date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"))
I want to add a column that will have a 1 that indicates if an amount is a recurring amount.我想添加一列,该列将有一个 1 来指示金额是否为经常性金额。 A recurring amount can only be considered recurring if the amount repeats within the same id.
如果金额在同一 ID 内重复,则只能将经常性金额视为经常性金额。 So it would look like this:
所以它看起来像这样:
desiredResult <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
amount = c(2300, 1765, 2300, 1500, 2300, 180, 180),
date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"),
probableRecurringAmount = c(1,0,1,0,0,1,1))
The dataset is very large and I am having a hard time coming up with an efficient solution.数据集非常大,我很难想出一个有效的解决方案。 I was considering adding keys to a column based on combinations of these other columns, but I want to only have a binary flag.
我正在考虑根据这些其他列的组合向列添加键,但我只想有一个二进制标志。
You can do it like this:你可以这样做:
library(dplyr)
example %>%
group_by(id, amount) %>%
mutate(probableRecurringAmount = ifelse(n() > 1, 1, 0))
# A tibble: 7 x 4
# Groups: id, amount [5]
# id amount date probableRecurringAmount
#<fct> <dbl> <fct> <dbl>
#1 1 2300 2010-11-01 1
#2 1 1765 2010-11-02 0
#3 1 2300 2010-11-03 1
#4 1 1500 2010-11-04 0
#5 2 35 2010-11-01 0
#6 2 180 2010-11-02 1
#7 2 180 2010-11-03 1
You can use duplicated
to find duplicated rows, then join with the original data to flag both the original and the duplicate.您可以使用
duplicated
来查找重复的行,然后与原始数据连接以标记原始数据和重复数据。
library(tidyverse)
example <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
amount = c(2300, 1765, 2300, 1500, 35, 180, 180),
date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"))
# Find duplicated rows
dups = example %>%
select(id, amount) %>%
mutate(recurring=as.numeric(duplicated(.))) %>%
filter(recurring==1)
# Flag both the original and duplicated rows as recurring
example %>% left_join(dups, ) %>%
replace_na(list(recurring=0))
#> Joining, by = c("id", "amount")
#> id amount date recurring
#> 1 1 2300 2010-11-01 1
#> 2 1 1765 2010-11-02 0
#> 3 1 2300 2010-11-03 1
#> 4 1 1500 2010-11-04 0
#> 5 2 35 2010-11-01 0
#> 6 2 180 2010-11-02 1
#> 7 2 180 2010-11-03 1
Created on 2020-01-14 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2020 年 1 月 14 日创建
We can use duplicated
from base R
我们可以使用从
base R
duplicated
example$recurring <- +(duplicated(example[c('id', 'amount')])|
duplicated(example[c('id', 'amount')], fromLast = TRUE))
example$recurring
#[1] 1 0 1 0 0 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.