如果它们出现在html标记内，我该如何删除换行符？

Question

Sorry, another python newbie question. 对不起，另一个python新手问题。 I have a string: 我有一个字符串：

my_string = "<p>this is some \n fun</p>And this is \n some more fun!"

I would like: 我想要：

my_string = "<p>this is some fun</p>And this is \n some more fun!"

In other words, how do I get rid of '\\n' only if it occurs inside an html tag? 换句话说，我该如何摆脱“\\ n”的只有它发生的HTML标记内？

I have: 我有：

my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)

Which obviously won't work, but I'm stuck. 哪个显然不起作用，但我被卡住了。

Answer 1

Regular expressions are a bad match for HTML. 正则表达式与HTML不匹配。 Don't do it. 不要这样做。 See RegEx match open tags except XHTML self-contained tags . 请参阅RegEx匹配开放标签，XHTML自包含标签除外。

Instead, use an HTML parser. 而是使用HTML解析器。 Python ships with html.parser , or you can use Beautiful Soup or html5lib . Python附带了html.parser ，或者您可以使用Beautiful Soup或html5lib 。 All you have to do then is walk the tree and remove line breaks. 所有你需要做的就是走在树上并删除换行符。

Answer 2

You should try using BeautifulSoup ( bs4 ), this will allow you to parse XML tags and pages. 您应该尝试使用BeautifulSoup（ bs4 ），这将使您可以解析XML标签和页面。

>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p

This will pull out the new line in the p tag. 这将拉出p标签中的新行。 If the content has more than one tag, None can be used as well as a for loop, then gathering the children (using the tag.child property). 如果内容有多个标记，则可以使用None和for循环，然后收集子项（使用tag.child属性）。

For example: 例如：

>>> tags = soup.find_all(None)
>>> for tag in tags:
...    if tag.child is None:
...        tag.child.contents[0].replace('\n ', '')
...    else:
...        tag.contents[0].replace('\n ', '')

Though, this might not work exactly the way you want it (as web pages can vary), this code can be reproduced for your needs. 虽然，这可能无法完全按照您的方式工作（因为网页可能会有所不同），但可以根据您的需要重现此代码。

如果它们出现在html标记内，我该如何删除换行符？

问题描述

2 个解决方案

解决方案1
5 2013-01-27 17:58:49

解决方案2
2 已采纳 2013-01-27 18:18:33

如果它们出现在html标记内，我该如何删除换行符？

问题描述

2 个解决方案

解决方案1 5 2013-01-27 17:58:49

解决方案2 2 已采纳 2013-01-27 18:18:33

解决方案1
5 2013-01-27 17:58:49

解决方案2
2 已采纳 2013-01-27 18:18:33