[英]Removing repeated commas from Pandas Dataframe Column in other words I just need the text from the column with a comma separating them
I have this dataframe with the Text
column我有这个 dataframe 和
Text
列
Text![]() |
Cleaned Col![]() |
---|---|
, , , Apples, , , Hard Work, , ![]() |
Apples, Hard Work![]() |
, , , , , , , , Apples, , , , , ![]() |
Apples![]() |
Apples, , Watermelon, , , , , ,![]() |
Apples, Watermelon![]() |
, , , , , , , , , , , , , , , , , ![]() |
I would like to create a column such as Cleaned Col
essentially using regex.我想基本上使用正则表达式创建一个列,例如
Cleaned Col
I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)'
but I am not getting the right outcome.我查看了不同的模式,例如
r'\s*,*([^(a-zA-Z)]*)'
但我没有得到正确的结果。
Use Series.str.findall
for get words and join by comma:使用
Series.str.findall
获取单词并通过逗号连接:
df['Cleaned Col'] = df['Text'].str.findall('\w+').str.join(', ')
print (df)
Text Cleaned Col
0 , , , Apples , , , Bananas , , , Apples, Bananas
1 , , , , , , , , Apples , , , , , Apples
2 Apples , , Watermelon , , , , , , Apples, Watermelon
3 , , , , , , , , , , , , , , , , ,
You could try replacing the commas with spaces, then clearing out the left and right spaces and replacing the middle spaces with a comma:您可以尝试将逗号替换为空格,然后清除左右空格并将中间空格替换为逗号:
df['Cleaned Col'] = df['Text'].apply(lambda x: x.replace(',', ' ').lstrip().rstrip().replace(' ', ', ')
Since your fields are comma-delimited you can use由于您的字段以逗号分隔,您可以使用
# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,]+').str.join(', ')
# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')
The regex extracts all found matches and .str.join(', ')
joins the resulting list items into a single string.正则表达式提取所有找到的匹配项,然后
.str.join(', ')
将生成的列表项连接成一个字符串。 The regex ( see its demo ) means:正则表达式(参见其演示)表示:
[^\s,]+
- one or more chars other than whitespace and comma [^\s,]+
- 除空格和逗号外的一个或多个字符[^\s,]
- a single char other than whitespace and comma [^\s,]
- 除空格和逗号外的单个字符(?:[^,]*[^\s,])?
- an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma. If you have your commas padded with spaces and you really want to use Series.str.replace
, you could use如果你的逗号用空格填充并且你真的想使用
Series.str.replace
,你可以使用
df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,]+|[\s,]+$|(\s)*(,)[\s,]*', r'\2\1', regex=True)
See this regex demo .请参阅此正则表达式演示。
Details :详情:
^[\s,]+
- one or more whitespaces or commas at the start of string ^[\s,]+
- 字符串开头的一个或多个空格或逗号[\s,]+$
- one or more whitespaces or commas at the end of string [\s,]+$
- 字符串末尾的一个或多个空格或逗号(\s)*(,)[\s,]*
- zero or more whitespaces (the last one matched is kept in Group 1, \1
), then a comma (captured into Group 2, \2
) and then zero or more whitespace or comma chars. (\s)*(,)[\s,]*
- 零个或多个空格(最后一个匹配的保留在第 1 组中, \1
),然后是逗号(捕获到第 2 组中, \2
),然后是零或更多空格或逗号字符。 The replacement is Group 2 + Group 1 values.替换为第 2 组 + 第 1 组值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.