簡體   English   中英

Python:刪除列中的相似字符串

[英]Python: Removing similar strings in column

我有一個數據框,其中一列由字符串組成:

d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane"]})

問題:其中一些字符串可能非常相似,僅在例如一兩個單詞上有所不同。 我想刪除所有“重復項”,即刪除所有彼此相似的文章。 在上面的例子中,由於 1. 和 2. 行是相似的,我只想保留第一個。 同樣,第 3 行和第 5 行相似,我只想保留第 3 行。實際數據幀大約有 10 萬行。

我的嘗試:我認為一個好的起點是將字符串轉換為集合,以便進行簡單有效的比較:

d["text"].str.split().apply(set)

接下來,我將編寫一個函數,將每一行與所有其他行進行比較,如果與其他行的相似度至少為 90%,則將其刪除。 這是我如何做到的:

def find_duplicates(df):
    df = df.str.split().apply(set)
    ls_duplicates = []
    for i in range(len(df)):
        doc_i = df.iloc[i]
        for j in range(i+1, len(df)):
            doc_j = df.iloc[j]
            score = len(doc_i.intersection(doc_j)) / len(doc_i)
            if score > 0.9:
                ls_duplicates.append(i)
    return ls_duplicates

find_duplicates(d['text'])

這適用於我的目的,但運行速度非常慢。 有沒有辦法優化它?

比較文本是一個廣泛的主題,但要從刺痛列表中選擇最佳匹配,您可以嘗試:

import difflib

phrases =  ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
      "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
      "where are you going",
      "i'm going to the zoo to pet the animals",
      "where are you going jane"]

difflib.get_close_matches('where are you going', phrases)

結果按相似度得分排序:

['where are you going', 'where are you going jane']

方法get_close_matches執行模糊字符串匹配。

您還可以將函數應用於數據幀:

d['text_similar'] = d.text.apply(lambda row: difflib.get_close_matches(row, list(d[d.text!=row].text), cutoff = 0.8))

輸出:

                                                text                                       text_similar
0  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
1  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
2                                where are you going                         [where are you going jane]
3            i'm going to the zoo to pet the animals                                                 []
4                           where are you going jane                              [where are you going]

在上面的例子中,當cutoff = 0.8時, i'm going to the zoo to pet the animals沒有足夠好的相似字符串。

您可以使用difflib.SequenceMatcher並根據與其他信息關聯的百分比相似度( thr )過濾文本行

import difflib
# Threshold filter based on Percentage similarity
thr = 0.85
df['Flag'] = 0
for text in df['text'].tolist():
    df['temp'] = [difflib.SequenceMatcher(None, text1,text).ratio() for text1 in df['text'].tolist()]
    df.loc[df['temp'].gt(thr),['Flag']] = df['Flag'].max()+1
df.drop('temp',1)

df.loc[~df['Flag'].duplicated(keep='first')]

出去:

    text                                                 Flag   
0   hello, this is a test. we want to remove entri...   2   
2   where are you going                                 5   
3   i'm going to the zoo to pet the animals             4   

實際上這個問題必須用聚類模型來處理,並根據離中心較近的距離過濾文本信息。

如果您擔心降低時間復雜度,則需要通過在文本信息的單熱編碼向量上應用集群來使問題復雜化。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM