[英]check Levenshtein distance of each row to other row in pandas DataFrame?
我有兩個數據框:
df1 = pd.DataFrame({'text': ['hello world', 'world hello'], 'id': [11,31]})
df2 = pd.DataFrame({'test': ['hello', 'world'], 'id': [13,11]})
我想用 df2 計算 df1 中每個文本行的 Levenshtein 距離,如果分數 >=0.9,則從 df1 中刪除該記錄。
我試過的:
def check_levenshtein_distance(df1,df2):
score = []
with tqdm(total=df1.shape[0]) as pbar:
for index, row in df1.iterrows():
for index1, row1 in df2.iterrows():
dis = Levenshtein.ratio(str(row['text']), str(row1['text']))
if dis>=0.9:
score.append(index)
pbar.update(1)
return check
data_d = check_levenshtein_distance(df1, df2)
之后
df1 = df1.drop(df1.index[data_d])
純 pandas 中是否有更好更快的方法來執行此任務?
由於您已經指出先前的解決方案導致了 memory 問題(這並不奇怪,因為我們正在生成所有可能的組合)我有另一個建議。 它會慢一點,但它不會創建所有可能的組合,因此它需要更少的 memory。 我確實想敦促您重新考慮數據幀是否是 go 的最佳方式。 在處理大量文本時,數據框通常不是最佳解決方案......
import pandas
import Levenshtein
df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})
# Make sure the types of the columns are correct
df1["text"] = df1["text"].astype(str)
df2["test"] = df2["test"].astype(str)
def filter_rows(row: pandas.Series) -> pandas.Series:
# By default, the row doesn't need to be removed
row["remove"] = False
# Loop over the texts in the other dataframe
for text in df2["test"].values:
# Check the distance
if Levenshtein.ratio(row["text"], text) >= 0.9:
# Indicate that this row needs to be removed
row["remove"] = True
# Return the row, so don't look any futher!
return row
# If we didn't return yet, just return the default
return row
# Apply the function (this will create a new column called "remove", indicating if a row should be removed)
df1 = df1.apply(filter_rows, axis=1)
# Remove the rows that have the remove indication, and drop the column
df1 = df1.loc[~df1["remove"]].drop(columns=["remove"])
上一個答案:
試試這種方式:
import pandas
import Levenshtein
df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})
# Create all possible combinations by joining the dataframes on a fictional key
df1["key"] = 0
df2["key"] = 0
df = df1.merge(df2, on="key").drop(columns=["key"])
# Calculate the distances for all possible combinations
df["distance"] = df.apply(lambda row: Levenshtein.ratio(str(row["text"]), str(row["test"])), axis=1)
# Use the distances as a filter
df1.loc[df1["id"].isin(df.loc[df["distance"] < 0.9, "id_x"])]
我的理解是否正確,您想刪除df1
中至少一個df2['test']
元素與df1.loc[i, 'text']
的 Levenshtein 距離 >= 0.9 的任何行i
?
如果是這樣,那么您可以嘗試:
df1 = df1[df1['text'].map(lambda s: not any(Levenshtein.ratio(s, t) >= 0.9 for t in df2['test']))]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.