简体   繁体   中英

How to map the most “similar” strings from one list to another in python?

Given are two lists containing strings.

  1. One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.

  2. The other list contains mostly full addresses in which strings (organisations) from the first list may occur.

An Example:

addresses = [
             "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
             "Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
             "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
             "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",    
             "Computer Science Department, University of California, Santa Barbara, USA 93106",
             "Fraunhofer IAIS, Sankt Augustin, Germany",
             "Department of Computer Science, Cornell University, Ithaca, NY",
             "University of Wisconsin-Madison"
            ]

organisations = [
                 "Catholic University of Leuven"
                 "Fraunhofer IAIS"
                 "Cornell University of Ithaca"
                 "Tübingener Max Plank Institut"
                ]

As you can see the desired mapping would be:

"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of  Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS 
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --

My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.

Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!

PS: I don't need an optimal solution.

Use the following as your string distance function (instead of plain levenshtein distance):

def strdist(s1, s2):
    words1 = set(w for w in s1.split() if len(w) > 3)
    words2 = set(w for w in s2.split() if len(w) > 3)

    scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
    n_shared_words = len([s for s in scores if s <= 3])
    return -n_shared_words 

Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.

You can use soundex or metaphone to translate the sentence into a list of phonems, then compare the most similar lists.

Here is a Python implementation of the double-metaphone algo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM