简体   繁体   English

添加作为时间序列数据帧中重复数字的二进制指示符的列的最有效方法是什么?

[英]What is the most efficient way to add a column that is a binary indicator of a recurring number in time series dataframe?

I have a dataframe that is similar to this example dataframe:我有一个类似于此示例数据框的数据框:

example <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
                      amount = c(2300, 1765, 2300, 1500, 35, 180, 180),
                      date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"))

I want to add a column that will have a 1 that indicates if an amount is a recurring amount.我想添加一列,该列将有一个 1 来指示金额是否为经常性金额。 A recurring amount can only be considered recurring if the amount repeats within the same id.如果金额在同一 ID 内重复,则只能将经常性金额视为经常性金额。 So it would look like this:所以它看起来像这样:

desiredResult <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
                      amount = c(2300, 1765, 2300, 1500, 2300, 180, 180),
                      date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"),
                      probableRecurringAmount = c(1,0,1,0,0,1,1)) 

The dataset is very large and I am having a hard time coming up with an efficient solution.数据集非常大,我很难想出一个有效的解决方案。 I was considering adding keys to a column based on combinations of these other columns, but I want to only have a binary flag.我正在考虑根据这些其他列的组合向列添加键,但我只想有一个二进制标志。

You can do it like this:你可以这样做:

library(dplyr)    
example %>%
  group_by(id, amount) %>%
  mutate(probableRecurringAmount  = ifelse(n() > 1, 1, 0))

# A tibble: 7 x 4
# Groups:   id, amount [5]
# id      amount date       probableRecurringAmount
#<fct>  <dbl> <fct>                        <dbl>
#1 1       2300 2010-11-01                       1
#2 1       1765 2010-11-02                       0
#3 1       2300 2010-11-03                       1
#4 1       1500 2010-11-04                       0
#5 2         35 2010-11-01                       0
#6 2        180 2010-11-02                       1
#7 2        180 2010-11-03                       1

You can use duplicated to find duplicated rows, then join with the original data to flag both the original and the duplicate.您可以使用duplicated来查找重复的行,然后与原始数据连接以标记原始数据和重复数据。

library(tidyverse)
example <- data.frame(id = c("1","1","1", "1", "2", "2", "2"),
                      amount = c(2300, 1765, 2300, 1500, 35, 180, 180),
                      date = c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-01", "2010-11-02", "2010-11-03"))

# Find duplicated rows
dups = example %>% 
  select(id, amount) %>% 
  mutate(recurring=as.numeric(duplicated(.))) %>% 
  filter(recurring==1)

# Flag both the original and duplicated rows as recurring
example %>% left_join(dups, ) %>% 
  replace_na(list(recurring=0))
#> Joining, by = c("id", "amount")
#>   id amount       date recurring
#> 1  1   2300 2010-11-01         1
#> 2  1   1765 2010-11-02         0
#> 3  1   2300 2010-11-03         1
#> 4  1   1500 2010-11-04         0
#> 5  2     35 2010-11-01         0
#> 6  2    180 2010-11-02         1
#> 7  2    180 2010-11-03         1

Created on 2020-01-14 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 1 月 14 日创建

We can use duplicated from base R我们可以使用从base R duplicated

example$recurring <-  +(duplicated(example[c('id', 'amount')])|
         duplicated(example[c('id', 'amount')], fromLast = TRUE))
example$recurring
#[1] 1 0 1 0 0 1 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在数据框中移动列的最有效方法是什么 - what is the most efficient way to move a column in a dataframe 时间序列:为子集编写代码的最有效方法是什么? - Time series: What's the most efficient way to write code for subsets? 将因子矩阵转换为R中的二进制(指标)矩阵的最有效方法 - Most efficient way to turn factor matrix into binary (indicator) matrix in R 在大型栅格时间序列中使用movingFun的最有效方法是什么? - What's the most efficient way to use movingFun in large rasters time series? R:选择数据帧中某些行的最有效方法是什么 - R: What is the most efficient way to select certain rows in a dataframe 在R中分区和访问数据帧行的最有效方法是什么? - What's the most efficient way to partition and access dataframe rows in R? 计算符合逻辑标准的观察次数的最有效方法是什么? - What is the most efficient way to count the number of observations that fit a logical criteria? 在循环R中将​​元素添加到向量的最有效的方法(内存和时间)? - Most (memory and time) efficient way to add element to vector in loop R? 计算R中日期之间的最有效方法是什么? - What is the most efficient way to calculate time between dates in R? 更正列中文本类型数据的最有效方法是什么? - What is the most efficient way to correct text type data in a column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM