[英]How do I remove linebreaks ONLY if they occur inside html tags?
Sorry, another python newbie question. 对不起,另一个python新手问题。 I have a string:
我有一个字符串:
my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
I would like: 我想要:
my_string = "<p>this is some fun</p>And this is \n some more fun!"
In other words, how do I get rid of '\\n' only if it occurs inside an html tag?
换句话说,我该如何摆脱“\\ n”的只有它发生的HTML标记内?
I have: 我有:
my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)
Which obviously won't work, but I'm stuck. 哪个显然不起作用,但我被卡住了。
Regular expressions are a bad match for HTML. 正则表达式与HTML不匹配。 Don't do it.
不要这样做。 See RegEx match open tags except XHTML self-contained tags .
请参阅RegEx匹配开放标签,XHTML自包含标签除外 。
Instead, use an HTML parser. 而是使用HTML解析器。 Python ships with html.parser , or you can use Beautiful Soup or html5lib .
Python附带了html.parser ,或者您可以使用Beautiful Soup或html5lib 。 All you have to do then is walk the tree and remove line breaks.
所有你需要做的就是走在树上并删除换行符。
You should try using BeautifulSoup ( bs4
), this will allow you to parse XML tags and pages. 您应该尝试使用BeautifulSoup(
bs4
),这将使您可以解析XML标签和页面。
>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p
This will pull out the new line in the p tag. 这将拉出p标签中的新行。 If the content has more than one tag,
None
can be used as well as a for loop, then gathering the children (using the tag.child
property). 如果内容有多个标记,则可以使用
None
和for循环,然后收集子项(使用tag.child
属性)。
For example: 例如:
>>> tags = soup.find_all(None)
>>> for tag in tags:
... if tag.child is None:
... tag.child.contents[0].replace('\n ', '')
... else:
... tag.contents[0].replace('\n ', '')
Though, this might not work exactly the way you want it (as web pages can vary), this code can be reproduced for your needs. 虽然,这可能无法完全按照您的方式工作(因为网页可能会有所不同),但可以根据您的需要重现此代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.