从 pandas dataframe 列中删除编号、换行符、中断标记

Question

I have a dataframe with columns containing strings with newline, break takgs and list numbering:我有一个 dataframe 列，其中包含带有换行符、换行符和列表编号的字符串：

df['Side_Effects'][0]
'1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<br/>4.Anorexia\n<br/>5.Malaise\n<br/>6.Fever\n<br/>7.Pruritis\n<br/>8.Rash\n<br/>9.Headache\n<br/>10.Pharyngitis\n<br/>11.Cough\n<br/>'

First I have to remove the numberings, newline and br tags from all the strings of column.首先，我必须从列的所有字符串中删除编号、换行符和 br 标记。 I tried:我试过了：

df['Side_Effects'].replace(r'\\n',' ', regex=True, inplace=True)

and this:和这个：

df['Side_Effects'] = df['Side_Effects'].str.replace('</br>','')

but nothing seems to work... Would appreciate any help !!!但似乎没有任何工作......将不胜感激任何帮助！

Answer 1

Using Regex and str methods使用正则表达式和str方法

Ex:前任：

df = pd.DataFrame({'Col': ['1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<br/>4.Anorexia\n<br/>5.Malaise\n<br/>6.Fever\n<br/>7.Pruritis\n<br/>8.Rash\n<br/>9.Headache\n<br/>10.Pharyngitis\n<br/>11.Cough\n<br/>']})
df['New'] = df['Col'].str.replace('(<br/>|\d+\.)','').str.split().agg(" ".join)    #IF you need as list skip .agg(" ".join)
print(df)

Output: Output：

             Col                                 New                                            
0  1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<...  Nausea Vomiting Diarrhoea Anorexia Malaise Fev...

Answer 2

You may use您可以使用

df['Side_Effects'] = df['Side_Effects'].str.replace(r'(?m)^(?:<br/>)?\d+\.|<br/>', '').str.strip()

See regex demo见正则表达式演示

Details细节

(?m)^ - start of a line ( (?m) is an inline variant of the re.M / re.MULTILINE flag) (?m)^ - 行首（ (?m)是re.M / re.MULTILINE标志的内联变体）
(?: )? - an optional   string - 一个可选的 字符串
\d+\. - 1 or more digits and then a . - 1 个或多个数字，然后是.
| - or - 或者
  - just   string.   - 只是 字符串。

The .str.strip() removes any trailing whitespace. .str.strip()删除任何尾随空格。

从 pandas dataframe 列中删除编号、换行符、中断标记

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-28 15:35:45

解决方案2
1 2020-07-28 15:50:19

从 pandas dataframe 列中删除编号、换行符、中断标记

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-28 15:35:45

解决方案2 1 2020-07-28 15:50:19

解决方案1
2 已采纳 2020-07-28 15:35:45

解决方案2
1 2020-07-28 15:50:19