简体   繁体   中英

Partial string matching & replacement in R

I have a dataframe like this

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3 Invezzstment LLC
4   Investment_LLC
5   Haiperloop LLC
6   Inwestment LLC

I need to match all these fuzzy strings, so the end result should look like this:

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3   Investment LLC
4   Investment LLC
5    Hyperloop LLC
6   Investment LLC

So, actually, I must solve a partial match-and-replace task for categorical variable. There are a lot great functions in base R and packages to solve string matching, but I'm stuck to find a single solution for this kind of match-and-replace. I don't care which occurrence will replace other, for example "Investment LLC" or "Invezzstment LLC" are both equally fine. Just need them to be consistent .

Is there any single all-in-one function or a loop for this?

If you have a vector of correct spellings, agrep makes this reasonably easy:

myDataFrame$company <- sapply(myDataFrame$company, 
                              function(val){agrep(val, 
                                                  c('Investment LLC', 'Hyperloop LLC'), 
                                                  value = TRUE)})

myDataFrame
#          company
# 1 Investment LLC
# 2  Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5  Hyperloop LLC
# 6 Investment LLC

If you don't have such a vector, you can likely make one with clever application of adist or even just table if the correct spelling is repeated more than the others, which it likely will be (though isn't here).

So, after some time I ended up with this dumb code. Attention : It is not fully automating the process of replacement, because every time the proper matches should be verified by human, and every time we need a fine tune of agrep max.distance argument. I am totally sure there are ways to make it better and quicker, but this can help to get the job done.

    ##########
    # Manual renaming with partial matches
    ##########

    # a) Take a look at the desired column of factor variables
    sort(unique(MYDATA$names))   # take a look

    # ****
    Sensthreshold <- 0.2   # sensitivity of agrep, usually 0.1-0.2 get it right
    Searchstring <- "Invesstment LLC"   # what should I search?
    # ****

    # User-defined function: returns similar string on query in column
    Searcher <- function(input, similarity = 0.1) {
      unique(agrep(input, 
                   MYDATA$names,   # <-- define your column here
                   ignore.case = TRUE, value = TRUE,
                   max.distance = similarity))
    }

    # b) Make a search of desired string
    Searcher(Searchstring, Sensthreshold)   # using user-def function 
    ### PLEASE INSPECT THE OUTPUT OF THE SEARCH
    ### Did it get it right?

 =============================================================================#
    ## ACTION! This changes your dataframe!
    ## Please make backup before proceeding
    ## Please execute this code as a whole to avoid errors

    # c) Make a vector of cells indexes after checking output
    vector_of_cells <- agrep(Searchstring, 
                       MYDATA$names, ignore.case = TRUE,
                       max.distance = Sensthreshold)
    # d) Apply the changes
    MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
    # e) Check result
    unique(agrep(Searchstring, MYDATA$names, 
                 ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM