简体   繁体   中英

Compute Levenshtein Distance between two String Columns from two different dataframes

I have two dataframes which contain the same string column (hostname) and I want to compute the Levenshtein distance between all possible combinations of hostnames pairs between both dataframes and put the results in a third dataframe where I keep the distance for each possible combination but also both indexes of this combination.

For example, let's say I have these two dataframes:

Index      Hostname
85608             dlt-rly-tracker-3.datto.com
9378      lnv7bc4241e2.1528.ozvision.ozsn.net
22791             dlt-rly-tracker-1.datto.com
88922                                 pw-file
94560     lnv7bc4241e2.1528.ozvision.ozsn.net
13245                                       -
63604                                 pw-file
435839                                pw-file
95473                                       -
13856                                 pw-file
210705                                pw-file
30046                                       -
106917            dlt-rly-tracker-2.datto.com
415925                                pw-file
170471                                pw-file
73971                                       -
86885             dlt-rly-tracker-3.datto.com
162764                                pw-file
74791                                 pw-file

and the second dataframe:

Index     Hostname
93358                  device.dattobackup.com
34067             dlt-rly-tracker-5.datto.com
18083               46.104.89.54.in-addr.arpa
96798                                 pw-file
130940                                pw-file
31476     lnv7bc4241e2.1528.ozvision.ozsn.net
149723                                pw-file
52901                                       -
308834    lnv7bc4241e2.1528.ozvision.ozsn.net
24196                                 pw-file
69038                                       -
244454    lnv7bc4241e2.1528.ozvision.ozsn.net
2867                                        -
45549                        daisy.ubuntu.com
334378                                pw-file
86006               46.104.89.54.in-addr.arpa
430257                                pw-file
86150               46.104.89.54.in-addr.arpa
65189                                 pw-file

What I want to do is to take the first value of hostname (dlt-rly-tracker-3.datto.com) and compute the levenshtein distance with the all values of hostnames from the second dataframe (one by one). Store the results at the end of this process in a new dataframe that will look similar to the following:

Indexes         Distance    Hostnames
85608-93358     23          dlt-rly-tracker-3.datto.com,device.dattobackup.com
85608-34067     60          dlt-rly-tracker-3.datto.com,dlt-rly-tracker-5.datto.com

I really appreciate any help to solve my problem. Thanks.

The solution below will loop through both data frames and create a new dictionary with the data you want. You should then convert this dictionary to a data frame. Let me know if this helps!

 dist = {}
 for rowname, row in df.iterrows(): 
      for rowname1, row1 in df1.iterrows(): 
            L = Levenstein(row.Hostname, row1.Hostname)
            dist.update( {rowname+’-‘+rowname1 : (L, row.Hostname+’,’+row1.Hostname} )

Here is my proposition to solve it.

import pandas as pd
from nltk import edit_distance

Here you need to create your two DataFrames. I assume that they are called:

df1

df2

outputList = []
for rowLeft  in df1.iterrows():
    for rowRight in df2.iterrows():
        indexes = str(rowRight[1][0]) + "-" + str(rowLeft[1][0])
        distance = edit_distance(rowRight[1][1],rowLeft[1][1])
        hostNames = rowRight[1][1] + "-" + rowLeft[1][1]
        outputList.append({"Indexes": indexes, "Distance":distance, "Hostnames":hostNames})

outputDf = pd.DataFrame(outputList)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM