[英]Using R for extracting rows from a large dataset which meet multiple conditions for each vendor_id
Using R Programming.使用 R 编程。 I have dataset with Vendor_id, Bank_account_no and Date with over 3 million.
我有超过 300 万个 Vendor_id、Bank_account_no 和 Date 的数据集。 I want to get the rows for each vendor_id where the Bank_account_no changes, for example from X to X to X (at least three time, could be more than three) to Y (just once) to X again within a three months period.
我想获取 Bank_account_no 更改的每个 vendor_id 的行,例如在三个月内从 X 到 X 到 X (至少三次,可能超过三次)到 Y (仅一次)到 X 。 The dataset is such that the changes are all random so the window is not fixed with number of rows for each vendor_id.
数据集的变化都是随机的,因此 window 不固定每个 vendor_id 的行数。 I using the rle function to get length for different Bank_account_no.
我使用 rle function来获取不同 Bank_account_no 的长度。 Not sure how can I create a logic in R for these many rows considering I want to run this logic for each vendor_id.
考虑到我想为每个 vendor_id 运行此逻辑,不知道如何在 R 中为这么多行创建逻辑。 May be data.table can help in this.
可能是 data.table 可以提供帮助。 The input is as follows:
输入如下:
Vendor_ID Bank_account_no Date
<!-- -->
dddd X 24-12-2018
dddd X 24-12-2018
dddd X 26-12-2018
dddd Y 27-12-2018
dddd X 28-12-2018
dddd X 29-12-2018
dddd X 29-12-2018
dddd X 31-12-2018
dddd X 24-01-2019
dddd Z 25-01-2019
dddd X 28-01-2019
dddd G 28-01-2019
dddd G 28-01-2019
eeee A 30-01-2019
eeee A 31-01-2019
eeee A 31-01-2019
eeee B 31-01-2019
eeee A 31-01-2019
The output should be: output 应该是:
Vendor_ID Bank_account_no Date Case
<!-- -->
dddd X 24-12-2018 Case1
dddd X 24-12-2018 Case1
dddd X 26-12-2018 Case1
dddd Y 27-12-2018 Case1
dddd X 28-12-2018 Case1
dddd X 29-12-2018 Case2
dddd X 29-12-2018 Case2
dddd X 31-12-2018 Case2
dddd X 24-01-2019 Case2
dddd Z 25-01-2019 Case2
dddd X 28-01-2019 Case2
eeee A 30-01-2019 Case3
eeee A 31-01-2019 Case3
eeee A 31-01-2019 Case3
eeee B 31-01-2019 Case3
eeee A 31-01-2019 Case3
There might be edge cases not covered but this is an option to start you off:可能有一些边缘情况没有被涵盖,但这是一个让你开始的选项:
#get run length encoding for ID
DT[, r := rleid(ID)]
#identify locations of desired changes in ID
DT[, w := ID!=shift(ID, -1L) & ID==shift(ID, -2L)]
#set to TRUE at the start of a sequence
DT[, sw := replace(shift(w, 3L, fill=FALSE), 1L, TRUE), Vendor_ID]
#filter out those not part of any sequence
ans <- DT[-DT[, {
ix <- which(w)
if (length(ix) > 0L) .I[!r %in% r[sapply(ix, `+`, 0L:2L)]]
}, Vendor_ID]$V1]
#create the desired Case column
ans[, Case := paste0("Case", cumsum(sw))]
output: output:
Vendor_ID ID Date r w sw Case
1: dddd X 24-12-2018 1 FALSE TRUE Case1
2: dddd X 24-12-2018 1 FALSE FALSE Case1
3: dddd X 26-12-2018 1 TRUE FALSE Case1
4: dddd Y 27-12-2018 2 FALSE FALSE Case1
5: dddd X 28-12-2018 3 FALSE FALSE Case1
6: dddd X 29-12-2018 3 FALSE TRUE Case2
7: dddd X 29-12-2018 3 FALSE FALSE Case2
8: dddd X 31-12-2018 3 FALSE FALSE Case2
9: dddd X 24-01-2019 3 TRUE FALSE Case2
10: dddd Z 25-01-2019 4 FALSE FALSE Case2
11: dddd X 28-01-2019 5 FALSE FALSE Case2
12: eeee A 30-01-2019 7 FALSE TRUE Case3
13: eeee A 31-01-2019 7 FALSE FALSE Case3
14: eeee A 31-01-2019 7 TRUE FALSE Case3
15: eeee B 31-01-2019 8 NA FALSE Case3
16: eeee A 31-01-2019 9 NA FALSE Case3
data:数据:
library(data.table)
DT <- fread("Vendor_ID ID Date
dddd X 24-12-2018
dddd X 24-12-2018
dddd X 26-12-2018
dddd Y 27-12-2018
dddd X 28-12-2018
dddd X 29-12-2018
dddd X 29-12-2018
dddd X 31-12-2018
dddd X 24-01-2019
dddd Z 25-01-2019
dddd X 28-01-2019
dddd G 28-01-2019
dddd G 28-01-2019
eeee A 30-01-2019
eeee A 31-01-2019
eeee A 31-01-2019
eeee B 31-01-2019
eeee A 31-01-2019")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.