[英]Regex with multiple lines and HTML tags
我正在编写一个脚本,该脚本应依次打开10个文本文件(它们是来自不同网页的源代码)。 然后,我希望脚本通过并将\\n
替换<br />
任何实例。 然后,我希望它本质上删除整个标头。 无论如何,文档始终以DOCTYPE
开头,并且我想要的信息之前的最后一行
"decoration:underline">no year</span><br />
据我所知,正则表达式/.../s
表示“忽略换行符”,而我已经转义了</span>
标记中出现的HTML /
。 到目前为止,我有以下内容
import re
def create_linebreaks(l):
l = l.replace('<br />', r'\n')
return l
def clean_up(line):
line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
return line
data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br /> <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " /> <br /> <br /> <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> <br /> <br />"""
create_linebreaks(data)
clean_up(data)
print data
raw_input()
我得到的只是相同的字符串。
所需的输出类似于:
""" <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " />
<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> """
主要问题是您的regex模式对于Python是错误的。
在r'/^<!DOCTYPE.+no year<\\/span>/s'
,前导/
和尾随/s
被视为模式的一部分,而不是其行为的修饰符。 看起来像PHP的PCRE regex语法,Python不支持。 取而代之.
为了匹配包括换行符在内的任何字符,您需要设置re.DOTALL
标志,如下所示。
另一个问题是create_linebreaks()
和clean_up()
的返回值未分配回data
,因此更改丢失。
另外,您也不想在create_linebreaks()
使用换行符的原始字符串,可以使用普通字符串(否则您可以将<br />
替换为\\\\n
)。
import re
def create_linebreaks(l):
l = l.replace('<br />', '\n')
return l
def clean_up(line):
line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
return line
data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br /> <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " /> <br /> <br /> <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> <br /> <br />"""
data = create_linebreaks(data)
data = clean_up(data)
>>> print data
<b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " />
<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " />
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.