简体   繁体   中英

String similarity between list of words

I have a series transition of string sequences with each string separated by '<', the last element of each sequence is always the same, eg,:

0                    b>c>d>a
1                    d>c>c>a
2                    e>e>c>a
3                    d>b>c>a
4                    d>c>c>a

I want to calculate the similarity between each sequence with all other sequences, the level % of that similarity, and get the most frequent sequences in the dataset. I know this is general but what is the best approach to do this?

this is what I tried so far but is just returns a matrix, not the level of similarity or the most frequent sequences:

n = transition.shape[0]
for i,p1 in enumerate(transition):
    for j,p2 in enumerate(transition[i:]):
        sim[i,j+i] = sim[j+i,i] = np.sum(np.array(p1) ==  np.array(p2))

One of the possible solutions is to use Levenshtein Distance

And then with Python your code would look something like that:

pip install python-Levenshtein

import Levenshtein
dist = Levenshtein.distance('Levenshtein', 'Lenvinsten')
print(dist)

And you'll have to create a pivot table to put distances of all your string in one place.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM