合并R中的数据表因子级别

Question

Suppose I have a very large data table, one column of which is "ManufacturerName". 假设我有一个很大的数据表，其中一列是“ ManufacturerName”。 The data was not entered uniformly, so it's pretty messy. 数据输入不一致，因此非常混乱。 For example, there may be observations like: 例如，可能有如下观察结果：

ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...

I am looking for an automated way in R to try and consider similar names as one factor level. 我正在R中寻找一种自动化的方法来尝试将相似的名称视为一个因素级别。 I have learned the syntax to manually do this, for example: 我已经学会了手动执行此操作的语法，例如：

levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))

But I'm trying to think of an automated solution. 但是我试图考虑一种自动化的解决方案。 Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. 显然，由于我无法预期数据表中的每种排列类型，因此它并不是完美的。 But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. 但是也许可以搜索因子水平，去除标点符号/特殊字符，并根据常见的第一个单词创建水平。 Or any other ideas. 或其他任何想法。 Thanks! 谢谢！

Answer 1

Look into the stringdist package. 查看stringdist包。 For starters, you could do something like this: 对于初学者，您可以执行以下操作：

library(stringdist)

x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
#    1  2  3  4  5
# 2  1            
# 3  9 10         
# 4  6  7 15      
# 5 16 16 16 18   
# 6 15 15 15 17  1

For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep . 要获得更多帮助，请参阅?stringdistmatrix或在StackOverflow上进行模糊匹配，近似字符串匹配，字符串距离函数和agrep 。

合并R中的数据表因子级别

问题描述

1 个解决方案

解决方案1
0 2015-10-06 19:57:58

合并R中的数据表因子级别

问题描述

1 个解决方案

解决方案1 0 2015-10-06 19:57:58

解决方案1
0 2015-10-06 19:57:58