从 Pandas Dataframe 列中删除重复的逗号换句话说，我只需要列中的文本用逗号分隔

Question

I have this dataframe with the Text column我有这个 dataframe 和Text列

Text文本	Cleaned Col清洁上校
, , , Apples, , , Hard Work, , , , , 苹果, , , 努力工作, ,	Apples, Hard Work苹果，努力工作
, , , , , , , , Apples, , , , , ，，，，，，，，苹果，，，，，	Apples苹果
Apples, , Watermelon, , , , , ,苹果, , 西瓜, , , , , ,	Apples, Watermelon苹果、西瓜
, , , , , , , , , , , , , , , , , ,,,,,,,,,,,,,,,,,,,,

I would like to create a column such as Cleaned Col essentially using regex.我想基本上使用正则表达式创建一个列，例如Cleaned Col

I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)' but I am not getting the right outcome.我查看了不同的模式，例如r'\s*,*([^(a-zA-Z)]*)'但我没有得到正确的结果。

Answer 1

Use Series.str.findall for get words and join by comma:使用Series.str.findall获取单词并通过逗号连接：

df['Cleaned Col'] = df['Text'].str.findall('\w+').str.join(', ')
print (df)
                                      Text         Cleaned Col
0      , , , Apples , , , Bananas , , ,        Apples, Bananas
1    , , , , , , , , Apples , , , , ,                   Apples
2        Apples , , Watermelon , , , , , ,  Apples, Watermelon
3  , , , , , , , , , , , , , , , , ,

Answer 2

You could try replacing the commas with spaces, then clearing out the left and right spaces and replacing the middle spaces with a comma:您可以尝试将逗号替换为空格，然后清除左右空格并将中间空格替换为逗号：

df['Cleaned Col'] = df['Text'].apply(lambda x: x.replace(',', ' ').lstrip().rstrip().replace(' ', ', ')

Answer 3

Since your fields are comma-delimited you can use由于您的字段以逗号分隔，您可以使用

# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,]+').str.join(', ')

# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')

The regex extracts all found matches and .str.join(', ') joins the resulting list items into a single string.正则表达式提取所有找到的匹配项，然后.str.join(', ')将生成的列表项连接成一个字符串。 The regex ( see its demo ) means:正则表达式（参见其演示）表示：

[^\s,]+ - one or more chars other than whitespace and comma [^\s,]+ - 除空格和逗号外的一个或多个字符
[^\s,] - a single char other than whitespace and comma [^\s,] - 除空格和逗号外的单个字符
(?:[^,]*[^\s,])? - an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma. - 可选择出现除逗号以外的任何零个或多个字符，然后是除空格和逗号以外的字符。

If you have your commas padded with spaces and you really want to use Series.str.replace , you could use如果你的逗号用空格填充并且你真的想使用Series.str.replace ，你可以使用

df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,]+|[\s,]+$|(\s)*(,)[\s,]*', r'\2\1', regex=True)

See this regex demo .请参阅此正则表达式演示。

Details :详情：

^[\s,]+ - one or more whitespaces or commas at the start of string ^[\s,]+ - 字符串开头的一个或多个空格或逗号
[\s,]+$ - one or more whitespaces or commas at the end of string [\s,]+$ - 字符串末尾的一个或多个空格或逗号
(\s)*(,)[\s,]* - zero or more whitespaces (the last one matched is kept in Group 1, \1 ), then a comma (captured into Group 2, \2 ) and then zero or more whitespace or comma chars. (\s)*(,)[\s,]* - 零个或多个空格（最后一个匹配的保留在第 1 组中， \1 ），然后是逗号（捕获到第 2 组中， \2 ），然后是零或更多空格或逗号字符。

The replacement is Group 2 + Group 1 values.替换为第 2 组 + 第 1 组值。

从 Pandas Dataframe 列中删除重复的逗号换句话说，我只需要列中的文本用逗号分隔

问题描述

3 个解决方案

解决方案1
4 2021-10-01 07:04:28

解决方案2
4 2021-10-01 07:05:18

解决方案3
4 已采纳 2021-10-01 07:08:19

从 Pandas Dataframe 列中删除重复的逗号换句话说，我只需要列中的文本用逗号分隔

问题描述

3 个解决方案

解决方案1 4 2021-10-01 07:04:28

解决方案2 4 2021-10-01 07:05:18

解决方案3 4 已采纳 2021-10-01 07:08:19

解决方案1
4 2021-10-01 07:04:28

解决方案2
4 2021-10-01 07:05:18

解决方案3
4 已采纳 2021-10-01 07:08:19