如何通过两个属性过滤大型数据集并拆分为子集？ R/Grep

Question

I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.我发现自己处于grep()函数的极限，或者可能有有效的方法来做到这一点。

Start off a sample data-frame:从示例数据帧开始：

Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
           "30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )

ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
          "TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")


price <-c(seq(1:9))

df <- as.data.frame(cbind(Date, ISIN, price))

And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I )所需的 Result 是一个list()其中包含主数据文件的子集，如下所示（ Result_I的 3 个独立标识符为Result_I ）

The idea is that the data should first filter by ISIN and then filter by Date.这个想法是数据应该首先按 ISIN 过滤，然后按日期过滤。 this 2 step process should keep my data intact.这个两步过程应该保持我的数据完整。

Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)

Result_df <- cbind(Result_d, Result_I, Result_P)

Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol请保留以上内容用于演示目的，根据Result_d ，真实数据集在 450 多个不同日期的时间段内有 500 万行和 50 列，因此我正在寻找适用于 nrow 或 ncol 的内容

What i have so far:到目前为止我所拥有的：

I take all unique dates and store:我获取所有独特的日期并存储：

Unique_Dates <- unique(df$Date)

The same for the Identifiers:标识符相同：

Unique_ID <- unique(df$ISIN)

Now the grepping issue:现在是grepping问题：

If i wanted all rows containing Unique_Dates i would do something like:如果我想要包含Unique_Dates所有行，我会执行以下操作：

pattern <- paste(Unique_dates, collapse = "|")

result <- as.matrix(df[grep(pattern, df$Date),])

and this will retrieve basically the entire data set.这将基本上检索整个数据集。 i am wondering if anyone knows an efficient way of doing this.我想知道是否有人知道这样做的有效方法。

Thanks in advance.提前致谢。

Answer 1

Using dplyr :使用dplyr ：

library(dplyr)

Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
           "30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )

ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
          "TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")


price <-c(seq(1:9))

DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")



#Examine data ranges and frequencies

#date range
range(DF$Date)

#date frequency count
table(DF$Date)

#ISIN frequency count
table(DF$ISIN)


#select ISINs for filtering, with user defined choice of filters

# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)


subISIN = names(sort(table(DF$ISIN)))[2]


subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%  
as.data.frame()

#> subDF
#        Date         ISIN price
#1 2014-12-29 LU0168343191     7
#2 2014-12-30 LU0168343191     4
#3 2014-12-31 LU0168343191     1

Answer 2

We convert the 'data.frame' to 'data.table' ( setDT(df) ), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table ( .SD ) based on the 'i' index.我们将 'data.frame' 转换为 'data.table' ( setDT(df) )，按 'Date' 分组，根据grep返回的索引指定 'i' 并根据 Data.table ( .SD ) 子集在“i”索引上。

library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
#           Date         ISIN price
#1: 31-DEC-2014 LU0168343191     1
#2: 30-DEC-2014 LU0168343191     4
#3: 29-DEC-2014 LU0168343191     7

如何通过两个属性过滤大型数据集并拆分为子集？ R/Grep

问题描述

2 个解决方案

解决方案1
2 2016-10-13 16:02:15

解决方案2
0 已采纳 2016-10-13 15:49:04

如何通过两个属性过滤大型数据集并拆分为子集？ R/Grep

问题描述

2 个解决方案

解决方案1 2 2016-10-13 16:02:15

解决方案2 0 已采纳 2016-10-13 15:49:04

解决方案1
2 2016-10-13 16:02:15

解决方案2
0 已采纳 2016-10-13 15:49:04