簡體   English   中英

從 csv 文件中刪除相似的行

[英]Removing similar rows from a csv file

我想從 csv 文件中刪除類似的行。 是否有 function 可以比較字符串並在 80% 匹配時丟棄它們?

輸入數據

    Identifier  StructureData
0   Entry ID    Structure Title
1   5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2   5FK8        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
3   5FK7        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
4   5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5   5FKC        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
6   5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
7   6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
8   7ALJ        Structure of Drosophila Notch EGF domains 11-13
9   6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
10  3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
11  5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
12  5MGC        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
13  5JUV        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
14  5MGD        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
15  5IHR        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE

output數據

    Identifier  StructureData
0   Entry ID    Structure Title
1   5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2   7ALJ        Structure of Drosophila Notch EGF domains 11-13
3   6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
4   3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5   5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose

有幾種字符串相似性度量,這是內置的:

from difflib import SequenceMatcher

def measure_similarity(x, y):
    return SequenceMatcher(None, x, y).ratio()

s1 = '5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose'
s2 = '5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose'
sim = measure_similarity(s1, s2)

print(sim)

結果:

0.9528301886792453

您正在尋找一種叫做模糊匹配的東西。 這方面有很多庫,但 FuzzyWuzzy 是一個不錯的庫。 請參閱這篇文章以獲得良好的概述...

請注意,下面的代碼不完全符合您的期望,因為它包含數據中的第 7 行,因為與前幾行相比差異超過 80%。

from fuzzywuzzy import fuzz


data = """Identifier  StructureData
Entry ID    Structure Title
5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
5FK8        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
5FK7        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5FKC        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ        Structure of Drosophila Notch EGF domains 11-13
6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
5MGC        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
5JUV        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
5MGD        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
5IHR        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE"""

result = []
for line_data in data.splitlines():
    maxvalue = 0
    for line_result in result:
        maxvalue = max(maxvalue, fuzz.ratio(line_data, line_result))
    if maxvalue < 80:
        result.append(line_data)


for line in result:
    print(line)

output

Identifier  StructureData
Entry ID    Structure Title
5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ        Structure of Drosophila Notch EGF domains 11-13
6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM