[英]How to filter large data-sets by two attributes and split into subsets? R / Grep
I found myself at the limits of the grep()
function or perhaps there are efficient ways of doing this.我发现自己处于grep()
函数的极限,或者可能有有效的方法来做到这一点。
Start off a sample data-frame:从示例数据帧开始:
Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
"30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
df <- as.data.frame(cbind(Date, ISIN, price))
And the desired Result is a list()
containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I
)所需的 Result 是一个list()
其中包含主数据文件的子集,如下所示( Result_I
的 3 个独立标识符为Result_I
)
The idea is that the data should first filter by ISIN and then filter by Date.这个想法是数据应该首先按 ISIN 过滤,然后按日期过滤。 this 2 step process should keep my data intact.这个两步过程应该保持我的数据完整。
Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)
Result_df <- cbind(Result_d, Result_I, Result_P)
Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d
so i am lookign for something that is applicable irrespective of nrow or ncol请保留以上内容用于演示目的,根据Result_d
,真实数据集在 450 多个不同日期的时间段内有 500 万行和 50 列,因此我正在寻找适用于 nrow 或 ncol 的内容
What i have so far:到目前为止我所拥有的:
I take all unique dates and store:我获取所有独特的日期并存储:
Unique_Dates <- unique(df$Date)
The same for the Identifiers:标识符相同:
Unique_ID <- unique(df$ISIN)
Now the grepping issue:现在是grepping问题:
If i wanted all rows containing Unique_Dates
i would do something like:如果我想要包含Unique_Dates
所有行,我会执行以下操作:
pattern <- paste(Unique_dates, collapse = "|")
result <- as.matrix(df[grep(pattern, df$Date),])
and this will retrieve basically the entire data set.这将基本上检索整个数据集。 i am wondering if anyone knows an efficient way of doing this.我想知道是否有人知道这样做的有效方法。
Thanks in advance.提前致谢。
Using dplyr
:使用dplyr
:
library(dplyr)
Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
"30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")
#Examine data ranges and frequencies
#date range
range(DF$Date)
#date frequency count
table(DF$Date)
#ISIN frequency count
table(DF$ISIN)
#select ISINs for filtering, with user defined choice of filters
# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)
subISIN = names(sort(table(DF$ISIN)))[2]
subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%
as.data.frame()
#> subDF
# Date ISIN price
#1 2014-12-29 LU0168343191 7
#2 2014-12-30 LU0168343191 4
#3 2014-12-31 LU0168343191 1
We convert the 'data.frame' to 'data.table' ( setDT(df)
), grouped by 'Date', specify the 'i' based on the index returned with grep
and Subset the Data.table ( .SD
) based on the 'i' index.我们将 'data.frame' 转换为 'data.table' ( setDT(df)
),按 'Date' 分组,根据grep
返回的索引指定 'i' 并根据 Data.table ( .SD
) 子集在“i”索引上。
library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
# Date ISIN price
#1: 31-DEC-2014 LU0168343191 1
#2: 30-DEC-2014 LU0168343191 4
#3: 29-DEC-2014 LU0168343191 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.