简体   繁体   English

使用正则表达式删除 html 标签

[英]Remove html tags using Regex

在此处输入图片说明

Im trying to get rid of the HTML tags, to an extent it works, but not all the tags are removed.我试图摆脱 HTML 标签,在某种程度上它有效,但并非所有标签都被删除。 But the below mentioned tags aren't gone但是下面提到的标签没有消失

print('NOT DEALT WITH:')
for body in not_dealt_with_list:
#p = re.compile(r'<.*?[\\t\\n\\r\\s]*?.*?>')
    print(remove_tags(body))
    #print(p.sub('', body))
    #body = re.sub()

def remove_tags(content):
parser = lxml.html.HTMLParser(remove_comments=True, 
remove_blank_text=True)
document = lxml.html.document_fromstring(content, parser)
return document.text_content()

it looks like what you're trying to remove is embedded into a html comment (because it doesn't look like html there).看起来您要删除的内容已嵌入到 html 注释中(因为那里看起来不像 html)。 Html comments start with and that's what you have to search for. Html 注释开头,这就是您必须搜索的内容。

Try this regex to search for everything inside a comment to replace it afterwards over multiple lines尝试使用此正则表达式搜索注释中的所有内容,然后在多行中替换它

<!--(.|\n)*?-->

Let me know how it works out!让我知道它是如何工作的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM