如何删除不必要的标签？

Question

I have field "body" in my table (mysql) and there a lot of entries like: 我在我的表（mysql）中有字段“body”，并且有很多条目，例如：

</p><p>  &nbsp;</p><p>

</p><p> 
   </p><p>

A lot of spaces, new line, &nbsp, etc. How to remove it? 很多空间，新线，等等。如何删除它？

This not work: 这不起作用：

text.replace('</p><p>&nbsp;</p><p>', '</p><p>')
text.replace('</p><p>\n</p><p>', '</p><p>')

Answer 1

text = ''.join(text.split()) - 之后您可以继续替换。

Answer 2

I would parse such a file in a syntax tree, and then removed there empty leaves. 我会在语法树中解析这样的文件，然后删除空叶。 Then would generate the HTML file again. 然后会再次生成HTML文件。 Unfortunately I'm not working in python, I cannot specify the helpful libraries for this. 不幸的是我不在python中工作，我无法为此指定有用的库。

Answer 3

What @Jurlie Suggested is a Good approach. 什么@Jurlie建议是一个很好的方法。 Consider using BeautifulSoup for this purpouse. 考虑将BeautifulSoup用于此purpouse。 It is a very mature and robust library. 它是一个非常成熟和强大的库。

Answer 4

Try this regexp: 试试这个正则表达式：

>>> import re
>>> text = '''</p><p>  &nbsp;</p><p>
... 
... </p><p> 
...    </p><p>
... '''
>>> re.sub(r'<p>(?:&nbsp;|\s|<br \/>)*?</p>\s*', '', text)
'</p><p>\n'

Answer 5

text.strip('>&nbsp;').strip(' ').strip('\n').strip('\t')

如何删除不必要的标签？

问题描述

5 个解决方案

解决方案1
2 2012-03-14 08:24:09

解决方案2
1 2012-03-14 08:24:52

解决方案3
1 2012-03-14 09:11:04

解决方案4
0 2012-03-14 08:33:12

解决方案5
0 2012-03-14 08:43:48

如何删除不必要的标签？

问题描述

5 个解决方案

解决方案1 2 2012-03-14 08:24:09

解决方案2 1 2012-03-14 08:24:52

解决方案3 1 2012-03-14 09:11:04

解决方案4 0 2012-03-14 08:33:12

解决方案5 0 2012-03-14 08:43:48

解决方案1
2 2012-03-14 08:24:09

解决方案2
1 2012-03-14 08:24:52

解决方案3
1 2012-03-14 09:11:04

解决方案4
0 2012-03-14 08:33:12

解决方案5
0 2012-03-14 08:43:48