具有多行和HTML标签的正则表达式

Question

我正在编写一个脚本，该脚本应依次打开10个文本文件（它们是来自不同网页的源代码）。 然后，我希望脚本通过并将\\n替换<br />任何实例。 然后，我希望它本质上删除整个标头。 无论如何，文档始终以DOCTYPE开头，并且我想要的信息之前的最后一行

"decoration:underline">no year</span><br />

据我所知，正则表达式/.../s表示“忽略换行符”，而我已经转义了</span>标记中出现的HTML / 。 到目前为止，我有以下内容

import re
def create_linebreaks(l):
    l = l.replace('<br />', r'\n')
    return l
def clean_up(line):
    line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

create_linebreaks(data)
clean_up(data)
print data
raw_input()

我得到的只是相同的字符串。

所需的输出类似于：

"""  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  """

Answer 1

主要问题是您的regex模式对于Python是错误的。

在r'/^<!DOCTYPE.+no year<\\/span>/s' ，前导/和尾随/s被视为模式的一部分，而不是其行为的修饰符。 看起来像PHP的PCRE regex语法，Python不支持。 取而代之. 为了匹配包括换行符在内的任何字符，您需要设置re.DOTALL标志，如下所示。

另一个问题是create_linebreaks()和clean_up()的返回值未分配回data ，因此更改丢失。

另外，您也不想在create_linebreaks()使用换行符的原始字符串，可以使用普通字符串（否则您可以将<br />替换为\\\\n ）。

import re

def create_linebreaks(l):
    l = l.replace('<br />', '\n')
    return l

def clean_up(line):
    line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

data = create_linebreaks(data)
data = clean_up(data)

>>> print data

  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  


>>>

具有多行和HTML标签的正则表达式

问题描述

1 个解决方案

解决方案1
1 2014-09-17 14:02:44

具有多行和HTML标签的正则表达式

问题描述

1 个解决方案

解决方案1 1 2014-09-17 14:02:44

解决方案1
1 2014-09-17 14:02:44