Python-使用正则表达式删除HTML标签

Question

This usually is no hard task, but today I can't seem to remove a simple javascript tag.. 这通常不是一项艰巨的任务，但是今天我似乎无法删除一个简单的javascript标签。

The example I'm working with (formated) : 我正在使用的示例（格式化） ：

<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
    window.addEventListener('DOMContentLoaded', function(){
        window.postscribe && postscribe(document.querySelector(".realestate"),
        '<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
    });
</script>

The example I'm working with (raw) 我正在使用的示例（原始）

<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>

I would like to remove everything from <script (beginning of second line) to </script> (last line). 我想从<script （第二行的开头）到</script> （最后一行）中删除所有内容。 This will output only the first line, <section..> . 这将仅输出第一行<section..> 。

Here's my line of code: 这是我的代码行：

re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)

I'm clearly missing something, but I can't see what. 我显然缺少了一些东西，但看不到。
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed. 注意：我正在使用的文档主要包含纯文本，因此不需要使用lxml或类似文件进行解析。

Answer 1

Your first regex didn't work because character classes ( [...] ) are a collection of characters , not a string. 您的第一个正则表达式不起作用，因为字符类（ [...] ）是字符的集合，而不是字符串。 So it will only match if it finds <script separated from </script> by a string of characters that doesn't include any of < , / , s , c , etc. 因此，仅当找到由不包含< ， / ， s ， c等的任何字符的字符串分隔的<script与</script> ，它才会匹配。

Your second regex is better, and the only reason it's not working is because by default, the . 您的第二个regex更好，并且它不起作用的唯一原因是因为默认情况下是. wildcard does not match newlines. 通配符与换行符不匹配。 To tell it you want it to, you'll need to add the DOTALL flag: 要告诉它您想要它，您需要添加DOTALL标志：

re.sub(r'<script.+?</script>', '', text, flags=re.DOTALL)

Python-使用正则表达式删除HTML标签

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-02-13 14:28:57

Python-使用正则表达式删除HTML标签

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-02-13 14:28:57

解决方案1
3 已采纳 2017-02-13 14:28:57