简体   繁体   English

在R中的大数据表中查找字符串的大列表的最快方法

[英]Fastest way to find a big list of strings in a big data table in r

I have a list of around 15.000 user id's, 我有一个大约15.000个用户ID的列表,

> head(ID_data)
[1] "A01Y" "AC43" "BBN5" "JK45" "NT66" "WC44"

and a table with 3 columns and around 100.000 rows as a data.table: 还有一个具有3列和大约100.000行的表作为数据表。

> head(USER_data)
              V1                    V2                 V3
1:             0                  John               John
2:          A01Y        Martin 3311290               
3: Peter Johnson            Peter JK45                  x 
4:             1        wc44@email.com     wc44@email.com         
5:            NA                     x          
6:        419223    Christian 21221140     ac43@email.com

I want to know the row index of rows that contain a user id somewhere in one of the 3 columns. 我想知道3列之一中某处包含用户ID的行的行索引。

In the example above, the code should find row 2, 3, 4 and 6, since they contain "A01Y", "JK45", "WC44" and "AC43" somewhere in one or more of the 3 columns. 在上面的示例中,代码应该找到第2、3、4和6行,因为它们在3列中的一列或多列中的某处包含“ A01Y”,“ JK45”,“ WC44”和“ AC43”。

The main problem is the big amount of data. 主要问题是大量数据。

I have tried pasting "|" 我尝试粘贴“ |” between the ID's and use grep to search for "A01Y|JK45" etc.: 在ID之间并使用grep搜索“ A01Y | JK45”等:

toMatch <- paste(ID_data,collapse="|")
V1.matches <- grep(toMatch, USER_data$V1, ignore.case=TRUE)
V2.matches <- grep(toMatch, USER_data$V2, ignore.case=TRUE)
V3.matches <- grep(toMatch, USER_data$V3, ignore.case=TRUE)

but grep can only take a search pattern of around 2.500 ID's, so I would have to go through the ID's in blocks of size 2500. This takes around 15 minutes to compute. 但是grep只能采用大约2.500个ID的搜索模式,因此我将必须以2500个大小的块浏览ID。这大约需要15分钟来计算。

I have also tried using strapplyc, which can take a search pattern of around 9.999 ID's. 我也尝试过使用trapplyc,它可以采用大约9.999 ID的搜索模式。

Is there a faster way to find the row indices? 有没有更快的方法来查找行索引?

I was thinking of using sqldf() and do something like 我正在考虑使用sqldf()并做类似的事情

sqldf("SELECT * FROM USER_data, ID_data WHERE USER_data LIKE '%'+ID_data+'%'")

but I'm not sure how to do this exactly. 但我不确定如何确切地做到这一点。

Thanks a lot in advance for any suggestions. 提前非常感谢您的任何建议。

Not sure if this fast enough, but I've done it before with many rows and IDs. 不知道这样做是否足够快,但是我之前已经用很多行和ID做到了。 It took me some time, but no need to process IDs in blocks. 我花了一些时间,但不需要按块处理ID。

# list of ids
IDs = c("A01Y", "AC43", "BBN5", "JK45", "NT66", "WC44")

# example dataframe
dt =  data.frame(V1 = c("Christian 21223456","x", "wc44@email.com"),
                 V2 = c("0 John","1 wc44@email.com",  "wc44@email.com"),
                 V3 = c("1","0","A01Y Martin 3311290"))

dt

#                   V1               V2                  V3
# 1 Christian 21223456           0 John                   1
# 2                  x 1 wc44@email.com                   0
# 3     wc44@email.com   wc44@email.com A01Y Martin 3311290


# combine row elements in one big string
dt_rows = apply(dt, 1, function(x) paste(x,collapse = " "))

# update to lower case
IDs = tolower(IDs)
dt_rows = tolower(dt_rows)

# find in which row you have matches
sapply(IDs, grepl, dt_rows) 

#       a01y  ac43  bbn5  jk45  nt66  wc44
# [1,] FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE  TRUE
# [3,]  TRUE FALSE FALSE FALSE FALSE  TRUE


# find which row id has a match (at least one match)
which(apply(sapply(IDs, grepl, dt_rows), 1, sum) >= 1) 

# [1] 2 3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM