檢查 pandas DataFrame 中每一行到其他行的 Levenshtein 距離？

Question

我有兩個數據框：

df1 = pd.DataFrame({'text': ['hello world', 'world hello'], 'id': [11,31]})
df2 = pd.DataFrame({'test': ['hello', 'world'], 'id': [13,11]})

我想用 df2 計算 df1 中每個文本行的 Levenshtein 距離，如果分數 >=0.9，則從 df1 中刪除該記錄。

我試過的：

def check_levenshtein_distance(df1,df2):
    score = []
    with tqdm(total=df1.shape[0]) as pbar:    
        for index, row in df1.iterrows():
            for index1, row1 in df2.iterrows():
                dis = Levenshtein.ratio(str(row['text']), str(row1['text']))
                if dis>=0.9:
                    score.append(index)          
            pbar.update(1)
    return check

data_d = check_levenshtein_distance(df1, df2)

之后

df1 = df1.drop(df1.index[data_d])

純 pandas 中是否有更好更快的方法來執行此任務？

Answer 1

由於您已經指出先前的解決方案導致了 memory 問題（這並不奇怪，因為我們正在生成所有可能的組合）我有另一個建議。 它會慢一點，但它不會創建所有可能的組合，因此它需要更少的 memory。 我確實想敦促您重新考慮數據幀是否是 go 的最佳方式。 在處理大量文本時，數據框通常不是最佳解決方案......

import pandas
import Levenshtein

df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})

# Make sure the types of the columns are correct
df1["text"] = df1["text"].astype(str)
df2["test"] = df2["test"].astype(str)


def filter_rows(row: pandas.Series) -> pandas.Series:

    # By default, the row doesn't need to be removed
    row["remove"] = False

    # Loop over the texts in the other dataframe
    for text in df2["test"].values:

        # Check the distance
        if Levenshtein.ratio(row["text"], text) >= 0.9:

            # Indicate that this row needs to be removed
            row["remove"] = True

            # Return the row, so don't look any futher!
            return row

    # If we didn't return yet, just return the default
    return row


# Apply the function (this will create a new column called "remove", indicating if a row should be removed)
df1 = df1.apply(filter_rows, axis=1)

# Remove the rows that have the remove indication, and drop the column
df1 = df1.loc[~df1["remove"]].drop(columns=["remove"])

上一個答案：

試試這種方式：

import pandas
import Levenshtein

df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})

# Create all possible combinations by joining the dataframes on a fictional key
df1["key"] = 0
df2["key"] = 0
df = df1.merge(df2, on="key").drop(columns=["key"])

# Calculate the distances for all possible combinations
df["distance"] = df.apply(lambda row: Levenshtein.ratio(str(row["text"]), str(row["test"])), axis=1)

# Use the distances as a filter
df1.loc[df1["id"].isin(df.loc[df["distance"] < 0.9, "id_x"])]

Answer 2

我的理解是否正確，您想刪除df1中至少一個df2['test']元素與df1.loc[i, 'text']的 Levenshtein 距離 >= 0.9 的任何行i ？

如果是這樣，那么您可以嘗試：

df1 = df1[df1['text'].map(lambda s: not any(Levenshtein.ratio(s, t) >= 0.9 for t in df2['test']))]

檢查 pandas DataFrame 中每一行到其他行的 Levenshtein 距離？

問題描述

2 個解決方案

解決方案1
1 2020-12-01 10:42:52

解決方案2
0 2020-12-01 11:14:03

檢查 pandas DataFrame 中每一行到其他行的 Levenshtein 距離？

問題描述

2 個解決方案

解決方案1 1 2020-12-01 10:42:52

解決方案2 0 2020-12-01 11:14:03

解決方案1
1 2020-12-01 10:42:52

解決方案2
0 2020-12-01 11:14:03