简体   繁体   English

使用R聚类相似的单词

[英]Clustering similar words using R

I am working with RI have two dataframes. 我正在与RI有两个数据框。 The one contains 200000 words like "cat", "cats", "cts", "dogs", and "dog", and the other one contains words like "cat", and "dog". 一个包含200000个单词,例如“ cat”,“ cats”,“ cts”,“ dogs”和“ dog”,另一个包含诸如“ cat”和“ dog”的单词。

I want to cluster the first dataframe and replace all the similar words with the word that fits in the second dataframe. 我想将第一个数据框聚类,并用适合第二个数据框的单词替换所有相似的单词。 For example, "cats" and "cts" to become "cat". 例如,“ cats”和“ cts”成为“ cat”。

As mentioned by @G5W, the task requires user input. 如@ G5W所述,该任务需要用户输入。 Here is an example of how that could be done 这是一个如何做到的例子

# we have here pre-defined choices: any match must out of 'animals'
animals <- c('cat','dog','mouse')
# here is the text we want to match
text <- c('cats', 'cuts', 'dogs', 'dawg', 'frog', 'lion')
# now we use the string distance metric
# via the package stringdist & using metric 'jw'
# c.f. ?stringdist::stringdist
vapply(seq_along(text), 
       function (k) animals[which.min(stringdist::stringdist(text[k], animals, 'jw'))], 
       character(1))
# [1] "cat" "cat" "dog" "dog" "dog" "dog"

Notice that for instance lion is matched to dog as that is the closest match. 注意,例如, liondog匹配,因为这是最接近的匹配。

To further illustrate the points made in the comment section, consider the following 为了进一步说明注释部分中提出的观点,请考虑以下内容

stringdist::stringdist('cts', c('cats','cuts'), 'jw')
# [1] 0.08333333 0.08333333

The word cts is equidistant to both cats and cuts . 这个词cts是等距既catscuts Assume the two word are contained in the matching table animals , then in this case which.min would return (by default) the first instance of the minimal value, so we would obtain cats . 假设两个单词都包含在匹配的表格animals ,那么在这种情况下, which.min将返回(默认情况下)最小值的第一个实例,因此我们将获得cats

You see how this can turn out to be problematic: assume cts was supposed to be cuts , the above would yield a false value. 您会发现这怎么可能会带来问题:假设cts应该被cuts ,则上面的代码会产生错误的值。

Thanks for your question! 感谢您的提问!

I'm currently out of my house and typing the proposed solution on my iPhone, but I'll apply it to your example when I'm home. 我目前不在家中,并在iPhone上键入建议的解决方案,但是当我在家时,我会将其应用于您的示例。

The way to convert similar values is by using the agrep function. 转换相似值的方法是使用agrep函数。 You don't need any package for it, it's already in R . 您不需要任何包,它已经在R中。

Please leave a comment if you need specific examples :) 如果需要具体示例,请发表评论:)

Here is the functionality: 功能如下:

agrep(pattern, x, max.distance = 0.1, costs = NULL, ignore.case = FALSE, 
      value = FALSE, fixed = TRUE, useBytes = FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM