简体   繁体   English

从 dataframe 中删除相似的字符串重复项

[英]Remove similar character string duplicates from a dataframe

I have df which currently looks something like this:我有 df 目前看起来像这样:

Car Name      Number
Adam Leaf     9
Adamm Leaf    9
Adam Lea      NaN
Adam-Leaf     NaN
Adam/Leaf     9
Claire-Green  NaN
Cliare Green  3
Claire Green  3
Claire Gren   NaN
Claire/Green  3

I am trying to remove the variations to achieve something like this我正在尝试删除变化以实现这样的目标

Car Name      Number
Adam Leaf     9
Claire Green  3

here is one way from jellyfish这是jellyfish的一种方法

import jellyfish

s=df.groupby(df['Car Name'].apply(jellyfish.soundex)).first()
              Car Name  Number
Car Name                      
A354         Adam Leaf     9.0
C462      Claire-Green     3.0

This can be solved via calculating the Levenshtein distance or even better using the FuzzyWuzzy library这可以通过计算 Levenshtein 距离甚至更好地使用 FuzzyWuzzy 库来解决

https://www.datacamp.com/community/tutorials/fuzzy-string-python https://www.datacamp.com/community/tutorials/fuzzy-string-python

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM