简体   繁体   English

如果它们出现在html标记内,我该如何删除换行符?

[英]How do I remove linebreaks ONLY if they occur inside html tags?

Sorry, another python newbie question. 对不起,另一个python新手问题。 I have a string: 我有一个字符串:

my_string = "<p>this is some \n fun</p>And this is \n some more fun!"

I would like: 我想要:

my_string = "<p>this is some fun</p>And this is \n some more fun!"

In other words, how do I get rid of '\\n' only if it occurs inside an html tag? 换句话说,我该如何摆脱“\\ n”的只有它发生的HTML标记内?

I have: 我有:

my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)

Which obviously won't work, but I'm stuck. 哪个显然不起作用,但我被卡住了。

Regular expressions are a bad match for HTML. 正则表达式与HTML不匹配。 Don't do it. 不要这样做。 See RegEx match open tags except XHTML self-contained tags . 请参阅RegEx匹配开放标签,XHTML自包含标签除外

Instead, use an HTML parser. 而是使用HTML解析器。 Python ships with html.parser , or you can use Beautiful Soup or html5lib . Python附带了html.parser ,或者您可以使用Beautiful Souphtml5lib All you have to do then is walk the tree and remove line breaks. 所有你需要做的就是走在树上并删除换行符。

You should try using BeautifulSoup ( bs4 ), this will allow you to parse XML tags and pages. 您应该尝试使用BeautifulSoup( bs4 ),这将使您可以解析XML标签和页面。

>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p

This will pull out the new line in the p tag. 这将拉出p标签中的新行。 If the content has more than one tag, None can be used as well as a for loop, then gathering the children (using the tag.child property). 如果内容有多个标记,则可以使用None和for循环,然后收集子项(使用tag.child属性)。

For example: 例如:

>>> tags = soup.find_all(None)
>>> for tag in tags:
...    if tag.child is None:
...        tag.child.contents[0].replace('\n ', '')
...    else:
...        tag.contents[0].replace('\n ', '')

Though, this might not work exactly the way you want it (as web pages can vary), this code can be reproduced for your needs. 虽然,这可能无法完全按照您的方式工作(因为网页可能会有所不同),但可以根据您的需要重现此代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM