繁体   English   中英

如何从 csv 中的列的单元格中删除重复项

[英]how to delete duplicates from a cell of a column in csv

我有一个看起来像这样的 csv 文件:

场地 投球手
一个 1 p kumarp kumarp kumarz khanz khan
一个 2 AB DindaAB DindaSM PollockSM Pollock,JDP Oram
b 1 A NehraA NehraA NehraASM 波洛克
b 2 B LeeB LeeB LeeSR WatsonSR Watson
c 1 SM PollockSM PollockAB DindaAB Dinda

所需 Output

场地 投球手 投球手号
一个 1 p kumar,z khan 2
一个 2 AB丁达,SM波洛克 3
b 1 A Nehra,SM Pollock,,JDP Oram 2
b 2 B李,SR沃森 2
c 1 SM波洛克,AB丁达 2

和 AB Dinda,SM Pollock 和 SM Pollock,AB Dinda 在我们制作虚拟柱时被认为是相同的

我使用的代码

drop_duplicates(subset ="bowler",
                 keep = False, inplace = True)

我知道我的代码不正确

假设你有这个 dataframe:

  venue  innings                                         bowler
0     a        1              p kumarp kumarp kumarz khanz khan
1     a        2  AB DindaAB DindaSM PollockSM Pollock,JDP Oram
2     b        1               A NehraA NehraA NehraASM Pollock
3     b        2              B LeeB LeeB LeeSR WatsonSR Watson
4     c        1           SM PollockSM PollockAB DindaAB Dinda

然后你可以使用正则表达式来尝试清理你的数据。 例如:

import re


def clean(x):
    m = re.findall(r"(.{5,})\1", x)  # 5 is minimal lenght of a name, you can tweak this variable
    for name in m:
        x = x.replace(name, "").strip(",")
    return ",".join(m + [x]).strip(",")


df.bowler = df.bowler.apply(clean)
df["no of bowlers"] = df.bowler.apply(lambda x: len(x.split(",")))
print(df)

印刷:

  venue  innings                        bowler  no of bowlers
0     a        1                p kumar,z khan              2
1     a        2  AB Dinda,SM Pollock,JDP Oram              3
2     b        1           A Nehra,ASM Pollock              2
3     b        2               B Lee,SR Watson              2
4     c        1           SM Pollock,AB Dinda              2

您将需要regex来使用您应用列的自定义 function 来提取重复项:

import re

df = pd.read_csv('filename.csv')

def clean_bowler(text):
    duplicates = [i for i in re.findall(r'(.+?)\1+', text) if len(i)>2] # extract repetitive string patterns and filter out small repetitive string patterns within names (eg recurring letters), you can change the threshold of 2
    other_words = [re.sub(r'[^a-zA-Z\d\s:]', '', i) for i in re.split(r'|'.join(duplicates),text) if i] #filter out the other words using the duplicates as delimiters. Note: there is no way to identify consecutive unique names separately
    return duplicates + other_words

df['bowler'] = df['bowler'].apply(clean_bowler)
df['no.bowler'] = df['bowler'].apply(len)

output:

场地 投球手 投球手
0 一个 1 ['p kumar','z 汗'] 2
1 一个 2 ['AB 丁达'、'SM 波洛克'、'JDP 奥拉姆'] 3
2 b 1 ['A Nehra','ASM Pollock'] 2
3 b 2 ['B李','SR沃森'] 2
4 c 1 ['SM 波洛克','AB 丁达'] 2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM