简体   繁体   English

如何在 R 中逐字比较两个字符串

[英]How to compare two strings word by word in R

I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE".我有一个数据集,我们称之为“ORIGINALE”,由几个不同的行组成,只有两列,第一个称为“DESCRIPTION”,第二个称为“CODICE”。 The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE".描述列具有正确的信息,而作为键的列代码几乎总是空的,因此我试图在另一个数据集中搜索相应的代码,我们称之为“参考”。 I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset.我正在使用自然语言的列描述,并尝试将其与第二个数据集中的描述相匹配。 I have to match word by word since there may be a different order of words, synonims or abbreviations.我必须逐字匹配,因为可能有不同的单词顺序、同义词或缩写。 Then I calcolate the similarity score to keep only the best match and accept those above a certain score.然后我计算相似度分数以仅保留最佳匹配并接受高于某个分数的那些。 Is there a way to improve it?有没有办法改进它? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.我正在处理大约 300000 行,尽管我知道这总是需要时间,但也许有一种方法可以让它稍微快一点。

ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
   split1 <- strsplit(x$DESCRIPTION, " ")
   split2 <- strsplit(y$DESCRIPTION, " ")
   risultato <- vector()
   distanza <- vector()
      for(i in 1:NROW(split1)) {
      best_dist <- -5
      closest_match <- -5
        for(j in 1:NROW(split2)) {
          dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
            if (dist > best_dist) {
              closest_match <- y$DESCRIPTION[j]
              best_dist <- dist 
            } 
        } 
      distanza <- append(distanza, best_dist)    
      risultato <- append(risultato, closest_match)
      }
    confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
  }

match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")

Good question.好问题。 for loops are slow in R: R 中的 for 循环很慢:

for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {

For fast R, you need to vectorize your algorithm.对于快速 R,您需要对算法进行矢量化。 I'm not that handy with data.frame anymore, so I'll use its successor, data.table .我对data.frame不再那么方便了,所以我将使用它的继任者data.table

library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))

# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]

# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']

# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)

# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]

# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]

# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)

Here is result:这是结果:

                            DESCRIPTION CODICE                           DESCRIPTION.y CODICE.y N
1:     4c flower str jenny jane Chicago     NA     jenny jane flower street 4c Chicago 83776250 5
2:       mr peter 123 rose street 3b LA     NA       peter green 123 rose street 3b LA     aw56 6
3: washington miss sarah 430f name strt     NA sarah brown name street 430f washington  135tg67 4

PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops. PS:有更多节省内存的方法可以做到这一点,此代码可能会导致您的机器由于内存不足错误或 go 由于需要虚拟 memory 而导致缓慢崩溃,但如果不是,它应该比for 循环。

What about:关于什么:

library(stringdist)
library(dplyr)
library(tidyr) 

data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
         group_by(desc_o) %>% 
         filter(dist==max(dist))

  desc_o                                 desc_r                                   dist
  <chr>                                  <chr>                                   <dbl>
1 " 4c flower str jenny jane Chicago"    jenny jane flower street 4c Chicago     0.486
2 "mr peter 123 rose street 3b LA"       peter green 123 rose street 3b LA       0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385

The R tm (text mining) library can help here: R tm(文本挖掘)库可以在这里提供帮助:

library(tm)
library(proxy) # for computing cosine similarity
library(data.table)

ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))

# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)

# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'

# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))

# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)

# convert to a regular matrix
dtm = as.matrix(dtm)

# look at it (t() transpose for readability)
t(dtm)
            Docs
Terms        1 2 3 4 5 6
  123        1 0 0 0 1 0
  peter      1 0 0 0 1 0
  rose       1 0 0 0 1 0
  street     1 0 0 1 1 1
  chicago    0 1 0 0 0 1
  flower     0 1 0 0 0 1
  jane       0 1 0 0 0 1
  jenny      0 1 0 0 0 1
  str        0 1 0 0 0 0
  430f       0 0 1 1 0 0
  miss       0 0 1 0 0 0
  name       0 0 1 1 0 0
  sarah      0 0 1 1 0 0
  strt       0 0 1 0 0 0
  washington 0 0 1 1 0 0
  brown      0 0 0 1 0 0
  green      0 0 0 0 1 0

# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")

# result:
ORIGINALE               REFERENCE document
 document              4         5         6
        1      0.7958759 0.1055728 0.7763932   <-- difference (smaller = more similar)
        2      1.0000000 1.0000000 0.2000000
        3      0.3333333 1.0000000 1.0000000

# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
  apply(
    similarity,1,function(x){
      data.table(i=which.min(x),distance=min(x))
    }
  )
)

# result:
   i  distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE

# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)

# result:
   i                        DESCRIPTION.x CODICE.x  distance                           DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt       NA 0.3333333 sarah brown name street 430f washington  135tg67
2: 2       mr peter 123 rose street 3b LA       NA 0.1055728       peter green 123 rose street 3b LA     aw56
3: 3     4c flower str jenny jane Chicago       NA 0.2000000     jenny jane flower street 4c Chicago 83776250

# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM