两个数据库的模糊匹配和精确匹配

Question

I have two databases.我有两个数据库。 The first one has about 70k rows with 3 columns.第一个有大约 70k 行和 3 列。 the second one has 790k rows with 2 columns.第二个有 790k 行和 2 列。 Both databases have a common variable grantee_name .两个数据库都有一个公共变量grantee_name 。 I want to match each row of the first database to one or more rows of the second database based on this grantee_name .我想根据此grantee_name将第一个数据库的每一行与第二个数据库的一行或多行进行grantee_name 。 Note that merge will not work because the grantee_name do not match perfectly.请注意， merge将不起作用，因为grantee_name不完全匹配。 There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:有不同的拼写等。所以，我正在使用fuzzyjoin包并尝试以下操作：

library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
  grantee_name                 grantee_city grantee_state
  <chr>                        <chr>        <chr>        
1 (ICS)2 MAINE CHAPTER         CLEARWATER   FL           
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT   NY           
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER  MD           
4 10 CAN                       NEWBERRY     FL           
5 10 THOUSAND WINDOWS          LIVERMORE    CA           
6 100 BLACK MEN IN CHICAGO INC CHICAGO      IL   
... 7 - 70000 rows to go

> head(filings)
# A tibble: 6 x 2
  grantee_name                       ein 
  <chr>                             <dbl>               
1 ICS-2 MAINE CHAPTER              123456             
2 SUFFOLK COUNTY VANDERBILT        654321            
3 VOICE TREKKING A FUND OF VOICES  789456            
4 10 CAN                           654987               
5 10 THOUSAND MUSKETEERS INC       789123               
6 100 BLACK MEN IN HOUSTON INC     987321      

rows 7-790000 omitted for brevity

The above examples are clear enough to provide some good matches and some not-so-good matches.上面的例子很清楚，可以提供一些很好的匹配和一些不太好的匹配。 Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match.请注意，例如， 10 THOUSAND WINDOWS与10 THOUSAND MUSKETEERS INC最匹配，但这并不意味着它是一个很好的匹配。 There will be a better match somewhere in the filings data (not shown above). filings数据中的某处会有更好的匹配（上面未显示）。 That does not matter at this stage.在这个阶段这无关紧要。

So, I have tried the following:所以，我尝试了以下方法：

df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))

Totally new to R. This is resulting in an error: cannot allocate vector of size 375GB (with the big database of course).对 R 完全cannot allocate vector of size 375GB 。这会导致错误： cannot allocate vector of size 375GB （当然是大数据库）。 A sample of 100 rows from forfuzzy always works.来自forfuzzy的 100 行forfuzzy始终有效。 So, I thought of iterating over a list of 100 rows at a time.所以，我想一次迭代一个 100 行的列表。

I have tried the following:我尝试了以下方法：

n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))

df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)

I have also tried the above with mclapply instead of lapply .我也用mclapply而不是lapply尝试了上述lapply 。 Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3 .即使我尝试了高性能集群设置 3 个 CPU，每个 CPU 具有 480G 内存并使用mclapply和选项mc.cores=3 ，也会发生同样的错误。 Perhaps a foreach command could help, but I have no idea how to implement it.也许foreach命令可以提供帮助，但我不知道如何实现它。

I have been advised to use the purrr and repurrrsive packages, so I try the following:有人建议我使用purrr和repurrrsive软件包，所以我尝试以下操作：

purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))

This seems to be working, after a novice error in the by=grantee_name statement.在by=grantee_name语句中出现新手错误后，这似乎有效。 However, it is taking forever and I am not sure it will work.但是，它需要永远，我不确定它会起作用。 A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.一个包含 100 行的forfuzzy示例列表， n=10 （所以 10 个列表，每个列表 10 行）已经运行了 50 分钟，但仍然没有结果。

Answer 1

I haven't used foreach before but maybe the variable x is already the individual rows of zz1?我以前没有使用过 foreach 但也许变量 x 已经是 zz1 的各个行？

Have you tried:你有没有尝试过：

stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")

? ?

Answer 2

If you split (with base::split or dplyr::group_split ) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list.如果您将（使用base::split或dplyr::group_split ）您的 uniquegrantees 数据框拆分为数据框列表，则您可以在列表上调用purrr::map 。 ( map is pretty much lapply ) （ map非常lapply ）

purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))

Your result will be a list of data frames each fuzzyjoined with filings.您的结果将是一个数据框列表，每个数据框都与文件模糊连接。 You can then call bind_rows (or you could do map_dfr ) to get all the results in the same data frame again.然后，您可以调用bind_rows （或者您可以执行map_dfr ）以再次获取同一数据框中的所有结果。

See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe请参阅R - 将大数据帧拆分为几个较小的数据帧，对每个数据帧执行模糊连接并输出到单个数据帧

两个数据库的模糊匹配和精确匹配

问题描述

2 个解决方案

解决方案1
0 2020-10-20 07:13:25

解决方案2
0 已采纳 2020-10-20 14:52:06

两个数据库的模糊匹配和精确匹配

问题描述

2 个解决方案

解决方案1 0 2020-10-20 07:13:25

解决方案2 0 已采纳 2020-10-20 14:52:06

解决方案1
0 2020-10-20 07:13:25

解决方案2
0 已采纳 2020-10-20 14:52:06