简体   繁体   English

如何通过两个属性过滤大型数据集并拆分为子集? R/Grep

[英]How to filter large data-sets by two attributes and split into subsets? R / Grep

I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.我发现自己处于grep()函数的极限,或者可能有有效的方法来做到这一点。

Start off a sample data-frame:从示例数据帧开始:

Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
           "30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )

ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
          "TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")


price <-c(seq(1:9))

df <- as.data.frame(cbind(Date, ISIN, price))

And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I )所需的 Result 是一个list()其中包含主数据文件的子集,如下所示( Result_I的 3 个独立标识符为Result_I

The idea is that the data should first filter by ISIN and then filter by Date.这个想法是数据应该首先按 ISIN 过滤,然后按日期过滤。 this 2 step process should keep my data intact.这个两步过程应该保持我的数据完整。

Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)

Result_df <- cbind(Result_d, Result_I, Result_P)

Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol请保留以上内容用于演示目的,根据Result_d ,真实数据集在 450 多个不同日期的时间段内有 500 万行和 50 列,因此我正在寻找适用于 nrow 或 ncol 的内容

What i have so far:到目前为止我所拥有的:

I take all unique dates and store:我获取所有独特的日期并存储:

Unique_Dates <- unique(df$Date)

The same for the Identifiers:标识符相同:

Unique_ID <- unique(df$ISIN)

Now the grepping issue:现在是grepping问题:

If i wanted all rows containing Unique_Dates i would do something like:如果我想要包含Unique_Dates所有行,我会执行以下操作:

pattern <- paste(Unique_dates, collapse = "|")

result <- as.matrix(df[grep(pattern, df$Date),])

and this will retrieve basically the entire data set.这将基本上检索整个数据集。 i am wondering if anyone knows an efficient way of doing this.我想知道是否有人知道这样做的有效方法。

Thanks in advance.提前致谢。

Using dplyr :使用dplyr

library(dplyr)

Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
           "30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )

ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
          "TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")


price <-c(seq(1:9))

DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")



#Examine data ranges and frequencies

#date range
range(DF$Date)

#date frequency count
table(DF$Date)

#ISIN frequency count
table(DF$ISIN)


#select ISINs for filtering, with user defined choice of filters

# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)


subISIN = names(sort(table(DF$ISIN)))[2]


subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%  
as.data.frame()

#> subDF
#        Date         ISIN price
#1 2014-12-29 LU0168343191     7
#2 2014-12-30 LU0168343191     4
#3 2014-12-31 LU0168343191     1

We convert the 'data.frame' to 'data.table' ( setDT(df) ), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table ( .SD ) based on the 'i' index.我们将 'data.frame' 转换为 'data.table' ( setDT(df) ),按 'Date' 分组,根据grep返回的索引指定 'i' 并根据 Data.table ( .SD ) 子集在“i”索引上。

library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
#           Date         ISIN price
#1: 31-DEC-2014 LU0168343191     1
#2: 30-DEC-2014 LU0168343191     4
#3: 29-DEC-2014 LU0168343191     7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM