[英]Removing similar rows from a csv file
我想從 csv 文件中刪除類似的行。 是否有 function 可以比較字符串並在 80% 匹配時丟棄它們?
輸入數據
Identifier StructureData
0 Entry ID Structure Title
1 5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2 5FK8 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
3 5FK7 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
4 5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5 5FKC Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
6 5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
7 6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
8 7ALJ Structure of Drosophila Notch EGF domains 11-13
9 6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
10 3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
11 5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
12 5MGC STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
13 5JUV STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
14 5MGD STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
15 5IHR STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE
output數據
Identifier StructureData
0 Entry ID Structure Title
1 5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2 7ALJ Structure of Drosophila Notch EGF domains 11-13
3 6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
4 3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5 5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
有幾種字符串相似性度量,這是內置的:
from difflib import SequenceMatcher
def measure_similarity(x, y):
return SequenceMatcher(None, x, y).ratio()
s1 = '5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose'
s2 = '5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose'
sim = measure_similarity(s1, s2)
print(sim)
結果:
0.9528301886792453
您正在尋找一種叫做模糊匹配的東西。 這方面有很多庫,但 FuzzyWuzzy 是一個不錯的庫。 請參閱這篇文章以獲得良好的概述...
請注意,下面的代碼不完全符合您的期望,因為它包含數據中的第 7 行,因為與前幾行相比差異超過 80%。
from fuzzywuzzy import fuzz
data = """Identifier StructureData
Entry ID Structure Title
5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
5FK8 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
5FK7 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5FKC Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ Structure of Drosophila Notch EGF domains 11-13
6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
5MGC STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
5JUV STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
5MGD STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
5IHR STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE"""
result = []
for line_data in data.splitlines():
maxvalue = 0
for line_result in result:
maxvalue = max(maxvalue, fuzz.ratio(line_data, line_result))
if maxvalue < 80:
result.append(line_data)
for line in result:
print(line)
output
Identifier StructureData
Entry ID Structure Title
5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ Structure of Drosophila Notch EGF domains 11-13
6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.