[英]Remove similar character string duplicates from a dataframe
I have df which currently looks something like this:我有 df 目前看起来像这样:
Car Name Number
Adam Leaf 9
Adamm Leaf 9
Adam Lea NaN
Adam-Leaf NaN
Adam/Leaf 9
Claire-Green NaN
Cliare Green 3
Claire Green 3
Claire Gren NaN
Claire/Green 3
I am trying to remove the variations to achieve something like this我正在尝试删除变化以实现这样的目标
Car Name Number
Adam Leaf 9
Claire Green 3
here is one way from jellyfish
这是
jellyfish
的一种方法
import jellyfish
s=df.groupby(df['Car Name'].apply(jellyfish.soundex)).first()
Car Name Number
Car Name
A354 Adam Leaf 9.0
C462 Claire-Green 3.0
This can be solved via calculating the Levenshtein distance or even better using the FuzzyWuzzy library这可以通过计算 Levenshtein 距离甚至更好地使用 FuzzyWuzzy 库来解决
https://www.datacamp.com/community/tutorials/fuzzy-string-python https://www.datacamp.com/community/tutorials/fuzzy-string-python
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.