Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist
package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix
or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.