简体   繁体   English

用于去除脚本标记的Python正则表达式

[英]Python regular expression to strip script tags

I'm a little scared to ask this for fear of retribution from the SO "You can't parse HTML with regular expressions" cult. 我有点害怕问这个因为害怕报复“你无法用正则表达式解析HTML”邪教。 Why does re.subn(r'<(script).*?</\\1>', '', data, re.DOTALL) not strip the multiline 'script' but only the two single-line ones at the end, please? 为什么re.subn(r'<(script).*?</\\1>', '', data, re.DOTALL)不会删除多行'脚本',而只删除最后的两个单行'脚本',请?

Thanks, HC 谢谢,HC

>>> import re
>>> data = """\
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script> 
    <script type="text/javascript" src="../_static/doctools.js"></script>
"""

>>> print (re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)[0])
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 

Leaving aside the question of whether this is a good idea in general, the problem with your example is that the fourth parameter to re.subn is count - there's no flags parameter in Python 2.6, although it was introduced as a fifth parameter in Python 2.7. 撇开一般来说这是否是一个好主意的问题,你的例子的问题是re.subn第四个参数是count - Python 2.6中没有flags参数,尽管它是作为Python 2.7中的第五个参数引入的。 Instead you can add `(?s) to the end of your regular expression for the same effect: 相反,你可以在正则表达式的末尾添加`(?s)以获得相同的效果:

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', data)[0])

<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 




>>>

... or if you're using Python 2.7, this should work: ...或者如果您使用的是Python 2.7,这应该可行:

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', 0, data)[0])

... ie inserting 0 as the count parameter. ...即插入0作为count参数。

Just in case it's of interest, I thought I'd add an additional answer showing two ways of doing this with lxml , which I've found very nice for parsing HTML. 为了防止它感兴趣,我想我会添加一个额外的答案,显示使用lxml执行此操作的两种方法,我发现它非常适合解析HTML。 (lxml is one of the alternatives that the author of BeautifulSoup suggests , in light of the problems with the most recent version of the latter library.) (lxml是BeautifulSoup的作者建议的替代方案之一,考虑到后一个库的最新版本的问题。)

The point of adding the first example is that it's really very simple and should be much more robust than using a regular expression to remove the tags. 添加第一个示例的重点是它非常简单,并且比使用正则表达式删除标记要强大得多。 In addition, if you want to do any more complex processing of the document, or if the HTML you're parsing is malformed, you have a valid document tree that you can manipulate programmatically. 此外,如果要对文档执行任何更复杂的处理,或者如果要解析的HTML格式不正确,则可以使用可以通过编程方式操作的有效文档树。

Remove all script tags 删除所有脚本标记

This example is based on the HTMLParser example from lxml's documentation : 此示例基于lxml文档中的HTMLParser示例

from lxml import etree
from StringIO import StringIO

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

for s in tree.xpath('//script'):
    s.getparent().remove(s)

print etree.tostring(tree.getroot(), pretty_print=True)

That produces this output: 这产生了这个输出:

<html>
  <head>
    <title>Regular Expression HOWTO &#8212; Python v2.7.1 documentation</title>
  </head>
</html>

Use lxml's Cleaner module 使用lxml的Cleaner模块

On the other hand, since it looks as if you're trying to remove awkward tags like <script> perhaps the Cleaner module from lxml will also do other things you'd like: 另一方面,因为看起来好像你正试图删除像<script>这样的笨拙标签,lxml中的Cleaner模块也会做你喜欢的其他事情:

from lxml.html.clean import Cleaner

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

cleaner = Cleaner(page_structure=False)
print cleaner.clean_html(broken_html)

... which produces the output: ...产生输出:

<html><head><title>Regular Expression HOWTO — Python v2.7.1 documentation</title></head></html>

(nb I've changed nothtml in your example to html - with your original, method 1 works fine, but wraps everything in <html><body> , but method 2 doesn't work for reasons I don't have time to figure out right now :)) (我已经在你的例子nothtml更改为html - 使用原始方法,方法1工作正常,但将所有内容包装在<html><body> ,但方法2因为我没有时间计算的原因不起作用现在出去:))

In order to remove html, style and script tages, you can use re. 为了删除html,样式和脚本tages,您可以使用re。

def stripTags(text):
  # scripts = re.compile(r'<script.*?/script>')
  scripts = re.compile(r'<(script).*?</\1>(?s)')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

I can work easily 我可以轻松工作

The short answer, is don't do that. 简短的回答是,不要这样做。 Use Beautiful Soup or elementree to get rid of them. 使用美丽的汤或元素来摆脱它们。 Parse your data as HTML or XML. 将数据解析为HTML或XML。 Regular expressions won't work and are the wrong answer to this problem. 正则表达式不起作用,是这个问题的错误答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM