简体   繁体   中英

Removing Fuzzy Duplicates in R

I have a dataset that looks something like this in R:

address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
            
 name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")
            
my_data = data.frame(address, name)

                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC
3 123 Fake Road Boston Drive Boston      BD Home 25 New
4        123 Fake - Rd Boston 56789  Boarding Direct 25

Looking at this data, it is clear that the first two rows are the same and the second two rows are the same. However, if you tried to remove duplicates directly, standard functions (eg " distinct() ") would state that there are no duplicates in this dataset, seeing as all rows have some unique element.

I have been trying to research different methods in R that are able to de-duplicate rows based on "fuzzy conditions".

Based on the answers provided here ( Techniques for finding near duplicate records ), I came across this method called "Record Linkage". I came across this specific tutorial over here ( https://cran.r-project.org/web/packages/RecordLinkage/vi.nettes/WeightBased.pdf ) that might be able to perform a similar task, but I am not sure if this is intended for the problem I am working on.

  • Can someone please help me confirm if this Record Linkage tutorial is in fact relevant to the problem I am working on - and if so, could someone please show me how to use it?

  • For example, I would like to remove duplicates based on the name and address - and only have two rows remaining (ie one row from row1/row2 and one row from row3/row4 - which ever one is chosen doesn't really matter).

  • As another example - suppose I wanted to try this and only de-duplicate based on the "address" column: is this also possible?

Can someone please show me how this could work?

Thank you!

Note: I have heard some options about using SQL JOINS along with FUZZY JOINS (eg https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html ) - but I am not sure if this option is also suitable.

For tasks like this, I like to use a divide and conquer strategy as you quickly run into memory issues comparing a larger number of strings or longer strings.

packages

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(stringdist)

phase 1: token similarity

I add an ID column and combine name and address into fulltext for comparison.

my_data2 <-  my_data|>
  mutate(ID = factor(row_number()),
         fulltext = paste(name, address))

In the quanteda approach to similarity is to divide strings into words/tokens before comparing which tokens are the same in two strings. This is extremely efficient compared to string distance:

duplicates <- my_data2 |> 
  # a bunch of wrangling to create the quanteda dfm object
  corpus(docid_field = "ID",
         text_field = "fulltext") |> 
  tokens() |> 
  dfm() |> 
  # calculate similarity using cosine (other methods are available)
  textstat_simil(method = "cosine") |> 
  as_tibble() |>
  # attaching the original documents back to the output 
  left_join(my_data2, by = c("document1" = "ID")) |> 
  left_join(my_data2, by = c("document2" = "ID"), suffix = c("", "_comparison"))

duplicates |> 
  select(cosine, 
         address, address_comparison, 
         name, name_comparison)
#> # A tibble: 5 × 5
#>   cosine address                           address_comparison      name  name_…¹
#>    <dbl> <chr>                             <chr>                   <chr> <chr>  
#> 1 0.641  882 4N Road River NY, NY 12345    882 - River Road NY, Z… ABC … Cent. …
#> 2 0.0801 882 4N Road River NY, NY 12345    123 Fake Road Boston D… ABC … BD Hom…
#> 3 0.0833 882 - River Road NY, ZIP 12345    123 Fake Road Boston D… Cent… BD Hom…
#> 4 0.0962 882 - River Road NY, ZIP 12345    123 Fake - Rd Boston 5… Cent… Boardi…
#> 5 0.481  123 Fake Road Boston Drive Boston 123 Fake - Rd Boston 5… BD H… Boardi…
#> # … with abbreviated variable name ¹​name_comparison

As you can see, the first and second, as well as the third and fourth entries have a rather high similarity with 0.641 and 0.481 respectively. This comparison can already be enough to identify duplicates in most cases. However, it completely ignores word order. The classic example is that "Dog bites man" and "Man bites dog" have a token similarity of 100%, yet an entirely different meaning. Look into your dataset to figure out if the order of tokens plays a role or not. If you think it does, read on.

phase 2: string similarity

String similarity as implemented in stringdist is a normalised version of the distance. See for distance, the length of the texts you compare plays no role. However, two 4 letter strings with two letters differing is very dissimilar while the same is not true for two 100 letter strings. Your example looks like this might not be a big issue, but in general, I prefer similarity for that reason.

The problem with string similarity and distance, however, is that they are computationally very costly. Even a couple of 100 short text can quickly take up your entire memory. So what you can do is to filter the results above and only calculate string similarity on the candidates which already look like they are duplicates:

duplicates_stringsim <- duplicates |> 
  filter(cosine > 0.4) |> 
  mutate(stringsim = stringsim(fulltext, fulltext_comparison, method = "lv"))

duplicates_stringsim |> 
  select(cosine, stringsim,
         address, address_comparison, 
         name, name_comparison)
#> # A tibble: 2 × 6
#>   cosine stringsim address                           address_com…¹ name  name_…²
#>    <dbl>     <dbl> <chr>                             <chr>         <chr> <chr>  
#> 1  0.641     0.48  882 4N Road River NY, NY 12345    882 - River … ABC … Cent. …
#> 2  0.481     0.354 123 Fake Road Boston Drive Boston 123 Fake - R… BD H… Boardi…
#> # … with abbreviated variable names ¹​address_comparison, ²​name_comparison

For comparison, the stringsim for the other three comparison that we have already eliminated are 0.2, 0.208 and 0.133. Even though a little smaller, the string similarities confirm the results from phase 1.

Now the final step is to remove the duplicates from the original data.frame. For this I use another filter, pull out the IDs from the duplicates_stringsim object and then remove these duplicates from the data.

dup_ids <- duplicates_stringsim |> 
  filter(stringsim > 0.3) |> 
  pull(document2)


my_data2 |> 
  filter(!ID %in% dup_ids)
#>                             address                name ID
#> 1    882 4N Road River NY, NY 12345 ABC Center Building  1
#> 2 123 Fake Road Boston Drive Boston      BD Home 25 New  3
#>                                             fulltext
#> 1 ABC Center Building 882 4N Road River NY, NY 12345
#> 2   BD Home 25 New 123 Fake Road Boston Drive Boston

Created on 2022-11-16 with reprex v2.0.2

Note that I chose the cutoff values based on your requirements for the example. You will have to fine tune these for your dataset and likely all new projects.

stringdist::stringdist() can be useful for finding near-duplicates, at least in relatively simple cases.

With your example data, we can perform a cartesian self-join to get all combinations of rows; use stringdist::stringdist() to compute distances* for all row-pairs for address and name ; and arrange with most similar row-pairs first:

library(dplyr)
library(tidyr)
library(stringdist)

my_data_dists <- my_data %>% 
  mutate(row = row_number()) %>% 
  full_join(., ., by = character()) %>% 
  filter(row.x < row.y) %>% 
  mutate(
    address.dist = stringdist(address.x, address.y),
    name.dist = stringdist(name.x, name.y)
  ) %>% 
  arrange(scale(address.dist) + scale(name.dist)) %>% 
  relocate(
    row.x, row.y,
    address.dist, name.dist,
    address.x, address.y, 
    name.x, name.y
  )
  row.x row.y address.dist name.dist                         address.x                         address.y              name.x             name.y
1     1     2           13        13    882 4N Road River NY, NY 12345    882 - River Road NY, ZIP 12345 ABC Center Building     Cent. Bldg ABC
2     3     4           15        16 123 Fake Road Boston Drive Boston        123 Fake - Rd Boston 56789      BD Home 25 New Boarding Direct 25
3     2     3           25        13    882 - River Road NY, ZIP 12345 123 Fake Road Boston Drive Boston      Cent. Bldg ABC     BD Home 25 New
4     1     3           25        15    882 4N Road River NY, NY 12345 123 Fake Road Boston Drive Boston ABC Center Building     BD Home 25 New
5     2     4           23        17    882 - River Road NY, ZIP 12345        123 Fake - Rd Boston 56789      Cent. Bldg ABC Boarding Direct 25
6     1     4           25        18    882 4N Road River NY, NY 12345        123 Fake - Rd Boston 56789 ABC Center Building Boarding Direct 25

From here, you can manually weed out duplicates, or eyeball the results to choose a distance threshold to consider rows "duplicates." If we take the latter approach: it looks like name.dist may not be a reliable metric (eg, one of the lowest values is a false positive), but address.dist scores below 20 seem reliable. You can then apply this to filter your original data.

dupes <- my_data_dists$row.y[my_data_dists$address.dist < 20]

my_data[-dupes,]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New

For more complex cases (eg, more columns, very large datasets), you're likely better off with RecordLinkage or some of the other suggestions in the comments. But I've found stringdist flexible and helpful for cases involving just a few columns.

Edit: An alternative interface is provided by stringdist::stringdistmatrix() or utils::adist() , which return a dist object or matrix of distances among elements of one or two vectors:

stringdistmatrix(my_data$name)
#    1  2  3
# 2 13      
# 3 15 13   
# 4 18 17 16

adist(my_data$name)
#      [,1] [,2] [,3] [,4]
# [1,]    0   13   15   18
# [2,]   13    0   13   17
# [3,]   15   13    0   16
# [4,]   18   17   16    0

Edit 2: I've added some more information in response to OP's questions in a gist .


* stringdist functions use optimal string alignment by default, but other metrics can be specified in the method argument.

Using agrep and a setting of max.distance = list(all = 0.6) (60%) gives good results when using both, address and name and only address. May vary slightly if used on larger data sets.

agrep
Approximate String Matching (Fuzzy Matching)

  • max.distance: Maximum distance allowed for a match.
    'all': maximal number/fraction of all transformations (insertions, deletions and substitutions)

Filter uniques, keeping the first entry (Can be adjusted to use longest entry etc).

my_data[unique(sapply(paste(my_data$address, my_data$name), function(x)
  agrep(x, paste(my_data$address, my_data$name), 
    max.distance = list(all = 0.6))[1])),]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New

Using only address

my_data[unique(sapply(my_data$address, function(x) 
  agrep(x, my_data$address, max.distance = list(all = 0.6))[1])),]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM