简体   繁体   English

从 Pandas Dataframe 列中删除重复的逗号换句话说,我只需要列中的文本用逗号分隔

[英]Removing repeated commas from Pandas Dataframe Column in other words I just need the text from the column with a comma separating them

I have this dataframe with the Text column我有这个 dataframe 和Text

Text文本 Cleaned Col清洁上校
, , , Apples, , , Hard Work, , , , , 苹果, , , 努力工作, , Apples, Hard Work苹果,努力工作
, , , , , , , , Apples, , , , , , , , , , , , , 苹果, , , , , Apples苹果
Apples, , Watermelon, , , , , ,苹果, , 西瓜, , , , , , Apples, Watermelon苹果、西瓜
, , , , , , , , , , , , , , , , , ,,,,,,,,,,,,,,,,,,,,

I would like to create a column such as Cleaned Col essentially using regex.我想基本上使用正则表达式创建一个列,例如Cleaned Col

I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)' but I am not getting the right outcome.我查看了不同的模式,例如r'\s*,*([^(a-zA-Z)]*)'但我没有得到正确的结果。

Use Series.str.findall for get words and join by comma:使用Series.str.findall获取单词并通过逗号连接:

df['Cleaned Col'] = df['Text'].str.findall('\w+').str.join(', ')
print (df)
                                      Text         Cleaned Col
0      , , , Apples , , , Bananas , , ,        Apples, Bananas
1    , , , , , , , , Apples , , , , ,                   Apples
2        Apples , , Watermelon , , , , , ,  Apples, Watermelon
3  , , , , , , , , , , , , , , , , ,                          

You could try replacing the commas with spaces, then clearing out the left and right spaces and replacing the middle spaces with a comma:您可以尝试将逗号替换为空格,然后清除左右空格并将中间空格替换为逗号:

df['Cleaned Col'] = df['Text'].apply(lambda x: x.replace(',', ' ').lstrip().rstrip().replace(' ', ', ')

Since your fields are comma-delimited you can use由于您的字段以逗号分隔,您可以使用

# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,]+').str.join(', ')

# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')

The regex extracts all found matches and .str.join(', ') joins the resulting list items into a single string.正则表达式提取所有找到的匹配项,然后.str.join(', ')将生成的列表项连接成一个字符串。 The regex ( see its demo ) means:正则表达式(参见其演示)表示:

  • [^\s,]+ - one or more chars other than whitespace and comma [^\s,]+ - 除空格和逗号外的一个或多个字符
  • [^\s,] - a single char other than whitespace and comma [^\s,] - 除空格和逗号外的单个字符
  • (?:[^,]*[^\s,])? - an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma. - 可选择出现除逗号以外的任何零个或多个字符,然后是除空格和逗号以外的字符。

If you have your commas padded with spaces and you really want to use Series.str.replace , you could use如果你的逗号用空格填充并且你真的想使用Series.str.replace ,你可以使用

df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,]+|[\s,]+$|(\s)*(,)[\s,]*', r'\2\1', regex=True)

See this regex demo .请参阅此正则表达式演示

Details :详情

  • ^[\s,]+ - one or more whitespaces or commas at the start of string ^[\s,]+ - 字符串开头的一个或多个空格或逗号
  • [\s,]+$ - one or more whitespaces or commas at the end of string [\s,]+$ - 字符串末尾的一个或多个空格或逗号
  • (\s)*(,)[\s,]* - zero or more whitespaces (the last one matched is kept in Group 1, \1 ), then a comma (captured into Group 2, \2 ) and then zero or more whitespace or comma chars. (\s)*(,)[\s,]* - 零个或多个空格(最后一个匹配的保留在第 1 组中, \1 ),然后是逗号(捕获到第 2 组中, \2 ),然后是零或更多空格或逗号字符。

The replacement is Group 2 + Group 1 values.替换为第 2 组 + 第 1 组值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM