R函数通过更接近单词的频率来校正单词

Question

I have a table with misspelling words. 我有一张拼错单词的表。 I need to correct those using from the words more similar to that one, the one that have more frequency. 我需要纠正那些使用更类似于那个的词，那个频率更高的词。

For example, after I run 例如，我跑完之后

aggregate(CustomerID ~ Province, ventas2, length)

I get 我明白了

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. 所以我需要用BUENOS AIRES替换布宜诺斯艾利斯，布宜诺斯艾利斯，Baires，布宜诺斯艾利斯，但不应该更换AMBA。 Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES. CORDOBESA和Cordoba也应该被CORDOBA取代，而不是CORRIENTES。

How can I do this in R? 我怎么能在R中这样做？

Thanks! 谢谢！

Answer 1

Here's a possibile solution. 这是一个可能的解决方案。

Disclaimer : 免责声明：
This code seems to works fine with your current example. 此代码似乎适用于您当前的示例。 I don't assure that the current parameters (eg cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data. 我不保证当前参数（例如切割高度，聚类聚集方法，距离方法等）对您的真实（完整）数据有效。

# recreating your data
data <- 
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS  AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)


# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))

# string distance computation
# N.B. here you can play with single components of distance costs 
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')

# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters) 
# N.B. I decided to cut at distance height < 5
#      (read it as: "I consider equal 2 strings needing
#       less than 5 modifications to pass from one to the other")
#      Obviously you can use another value.
rect.hclust(hc,h=4.9)

# get the clusters ids
clusters <- cutree(hc,h=4.9) 
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)

# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City') 

# add CityCorrected column to the merged data.frame
ret <- by(merged, 
          merged$ClusterId,
          FUN=function(grp){
                idx <- which.max(grp$Occur)
                grp$CityCorrected <- grp[idx,'City']
                return(grp)
              })

fixed <- do.call(rbind,ret)

Result : 结果：

> fixed
              City Occurr ClusterId CityCorrected
1             AMBA     29         1          AMBA
2.2         BAIRES      1         2  BUENOS AIRES
2.3    BENOS AIRES      1         2  BUENOS AIRES
2.4   BUENAS AIRES      1         2  BUENOS AIRES
2.5  BUENOS  AIRES      4         2  BUENOS AIRES
2.6   buenos aires      7         2  BUENOS AIRES
2.7   Buenos Aires      3         2  BUENOS AIRES
2.8   BUENOS AIRES  11337         2  BUENOS AIRES
3.9        cordoba      1         3       CORDOBA
3.10       CORDOBA   2297         3       CORDOBA
3.11     CORDOBESA      1         3       CORDOBA
4       CORRIENTES    424         4    CORRIENTES

Cluster Plot : 群集图：

在此输入图像描述

Answer 2

Here's my small replication of your aggregate result You'll need to change all the calls to data frames to fit whatever the structure of your data is. 这是我对聚合结果的小型复制您需要更改对数据框的所有调用以适应数据的结构。

df
#output
#       word freq
#1         a    1
#2         b    2
#3         c    3

#find the max frequency
mostFrequent<-max(df[,2])  #doesn't handle ties

#find the word we will be replacing with
replaceString<-df[df[,2]==mostFrequent,1]
#[1] "c"

#find all the other words to be replaced
tobereplaced<-df[df[,2]!=mostFrequent,1]
#[1] "a" "b"

Now say you have the following dataframe which contains your entire dataset, I'll just replicate a single column with words. 现在假设您有以下包含整个数据集的数据框，我只需要复制一个包含单词的列。

totalData
 #    [,1]
 #[1,] "a" 
 #[2,] "c" 
 #[3,] "b" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "a" 
 #[7,] "d" 
 #[8,] "b" 
 #[9,] "c"

We can replace all the words we want to replace, with the string we want to replace them with, by the following call 我们可以通过以下调用将要替换的所有单词替换为我们要替换它们的字符串

totaldata[totaldata%in%tobereplaced]<-replaceString
 #    [,1]
 #[1,] "c" 
 #[2,] "c" 
 #[3,] "c" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "c" 
 #[7,] "d" 
 #[8,] "c" 
 #[9,] "c"

As you can see, all a's and b's have been replaced with c, where the other words are the same 正如你所看到的，所有的a和b都被c替换，其他的是相同的

R函数通过更接近单词的频率来校正单词

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-09-09 21:56:40

解决方案2
0 2014-09-09 20:04:49

R函数通过更接近单词的频率来校正单词

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-09-09 21:56:40

解决方案2 0 2014-09-09 20:04:49

解决方案1
3 已采纳 2014-09-09 21:56:40

解决方案2
0 2014-09-09 20:04:49