![](/img/trans.png)
[英]How to delete an exact duplicates in a column in csv using python pandas
[英]how to delete duplicates from a cell of a column in csv
我有一个看起来像这样的 csv 文件:
场地 | 局 | 投球手 |
---|---|---|
一个 | 1 | p kumarp kumarp kumarz khanz khan |
一个 | 2 | AB DindaAB DindaSM PollockSM Pollock,JDP Oram |
b | 1 | A NehraA NehraA NehraASM 波洛克 |
b | 2 | B LeeB LeeB LeeSR WatsonSR Watson |
c | 1 | SM PollockSM PollockAB DindaAB Dinda |
所需 Output
场地 | 局 | 投球手 | 投球手号 |
---|---|---|---|
一个 | 1 | p kumar,z khan | 2 |
一个 | 2 | AB丁达,SM波洛克 | 3 |
b | 1 | A Nehra,SM Pollock,,JDP Oram | 2 |
b | 2 | B李,SR沃森 | 2 |
c | 1 | SM波洛克,AB丁达 | 2 |
和 AB Dinda,SM Pollock 和 SM Pollock,AB Dinda 在我们制作虚拟柱时被认为是相同的
我使用的代码
drop_duplicates(subset ="bowler",
keep = False, inplace = True)
我知道我的代码不正确
假设你有这个 dataframe:
venue innings bowler
0 a 1 p kumarp kumarp kumarz khanz khan
1 a 2 AB DindaAB DindaSM PollockSM Pollock,JDP Oram
2 b 1 A NehraA NehraA NehraASM Pollock
3 b 2 B LeeB LeeB LeeSR WatsonSR Watson
4 c 1 SM PollockSM PollockAB DindaAB Dinda
然后你可以使用正则表达式来尝试清理你的数据。 例如:
import re
def clean(x):
m = re.findall(r"(.{5,})\1", x) # 5 is minimal lenght of a name, you can tweak this variable
for name in m:
x = x.replace(name, "").strip(",")
return ",".join(m + [x]).strip(",")
df.bowler = df.bowler.apply(clean)
df["no of bowlers"] = df.bowler.apply(lambda x: len(x.split(",")))
print(df)
印刷:
venue innings bowler no of bowlers
0 a 1 p kumar,z khan 2
1 a 2 AB Dinda,SM Pollock,JDP Oram 3
2 b 1 A Nehra,ASM Pollock 2
3 b 2 B Lee,SR Watson 2
4 c 1 SM Pollock,AB Dinda 2
您将需要regex
来使用您应用列的自定义 function 来提取重复项:
import re
df = pd.read_csv('filename.csv')
def clean_bowler(text):
duplicates = [i for i in re.findall(r'(.+?)\1+', text) if len(i)>2] # extract repetitive string patterns and filter out small repetitive string patterns within names (eg recurring letters), you can change the threshold of 2
other_words = [re.sub(r'[^a-zA-Z\d\s:]', '', i) for i in re.split(r'|'.join(duplicates),text) if i] #filter out the other words using the duplicates as delimiters. Note: there is no way to identify consecutive unique names separately
return duplicates + other_words
df['bowler'] = df['bowler'].apply(clean_bowler)
df['no.bowler'] = df['bowler'].apply(len)
output:
场地 | 局 | 投球手 | 投球手 | |
---|---|---|---|---|
0 | 一个 | 1 | ['p kumar','z 汗'] | 2 |
1 | 一个 | 2 | ['AB 丁达'、'SM 波洛克'、'JDP 奥拉姆'] | 3 |
2 | b | 1 | ['A Nehra','ASM Pollock'] | 2 |
3 | b | 2 | ['B李','SR沃森'] | 2 |
4 | c | 1 | ['SM 波洛克','AB 丁达'] | 2 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.