[英]Remove numbering, newline, break tags from pandas dataframe columns
I have a dataframe with columns containing strings with newline, break takgs and list numbering:我有一个 dataframe 列,其中包含带有换行符、换行符和列表编号的字符串:
df['Side_Effects'][0]
'1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<br/>4.Anorexia\n<br/>5.Malaise\n<br/>6.Fever\n<br/>7.Pruritis\n<br/>8.Rash\n<br/>9.Headache\n<br/>10.Pharyngitis\n<br/>11.Cough\n<br/>'
First I have to remove the numberings, newline and br tags from all the strings of column.首先,我必须从列的所有字符串中删除编号、换行符和 br 标记。 I tried:我试过了:
df['Side_Effects'].replace(r'\\n',' ', regex=True, inplace=True)
and this:和这个:
df['Side_Effects'] = df['Side_Effects'].str.replace('</br>','')
but nothing seems to work... Would appreciate any help !!!但似乎没有任何工作......将不胜感激任何帮助!
Using Regex and str
methods使用正则表达式和str
方法
Ex:前任:
df = pd.DataFrame({'Col': ['1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<br/>4.Anorexia\n<br/>5.Malaise\n<br/>6.Fever\n<br/>7.Pruritis\n<br/>8.Rash\n<br/>9.Headache\n<br/>10.Pharyngitis\n<br/>11.Cough\n<br/>']})
df['New'] = df['Col'].str.replace('(<br/>|\d+\.)','').str.split().agg(" ".join) #IF you need as list skip .agg(" ".join)
print(df)
Output: Output:
Col New
0 1.Nausea\n<br/>2.Vomiting\n<br/>3.Diarrhoea\n<... Nausea Vomiting Diarrhoea Anorexia Malaise Fev...
You may use您可以使用
df['Side_Effects'] = df['Side_Effects'].str.replace(r'(?m)^(?:<br/>)?\d+\.|<br/>', '').str.strip()
See regex demo见正则表达式演示
Details细节
(?m)^
- start of a line ( (?m)
is an inline variant of the re.M
/ re.MULTILINE
flag) (?m)^
- 行首( (?m)
是re.M
/ re.MULTILINE
标志的内联变体)(?:<br/>)?
- an optional <br/>
string - 一个可选的<br/>
字符串\d+\.
- 1 or more digits and then a .
- 1 个或多个数字,然后是.
|
- or - 或者<br/>
- just <br/>
string. <br/>
- 只是<br/>
字符串。 The .str.strip()
removes any trailing whitespace. .str.strip()
删除任何尾随空格。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.