[英]How to remove similar strings as if they were duplicates from a dataframe?
[英]Remove similar character string duplicates from a dataframe
我有 df 目前看起來像這樣:
Car Name Number
Adam Leaf 9
Adamm Leaf 9
Adam Lea NaN
Adam-Leaf NaN
Adam/Leaf 9
Claire-Green NaN
Cliare Green 3
Claire Green 3
Claire Gren NaN
Claire/Green 3
我正在嘗試刪除變化以實現這樣的目標
Car Name Number
Adam Leaf 9
Claire Green 3
這是jellyfish
的一種方法
import jellyfish
s=df.groupby(df['Car Name'].apply(jellyfish.soundex)).first()
Car Name Number
Car Name
A354 Adam Leaf 9.0
C462 Claire-Green 3.0
這可以通過計算 Levenshtein 距離甚至更好地使用 FuzzyWuzzy 庫來解決
https://www.datacamp.com/community/tutorials/fuzzy-string-python
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.