简体   繁体   English

在python中使用正则表达式从文本中删除html标签

[英]Removing html tags from a text using Regular Expression in python

I'm trying to look at a html file and remove all the tags from it so that only the text is left but I'm having a problem with my regex.我试图查看一个 html 文件并从中删除所有标签,以便只留下文本,但我的正则表达式有问题。 This is what I have so far.这是我到目前为止。

import urllib.request, re
def test(url):
html = str(urllib.request.urlopen(url).read())
print(re.findall('<[\w\/\.\w]*>',html))

The html is a simple page with a few links and text but my regex won't pick up !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" and 'a href="...." tags. html 是一个简单的页面,带有一些链接和文本,但我的正则表达式不会接收 !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 和 'a href="...."标签。 Can anyone explain what I need to change in my regex?谁能解释我需要在正则表达式中更改什么?

Use BeautifulSoup .使用BeautifulSoup Use lxml .使用lxml Do not use regular expressions to parse HTML.不要使用正则表达式来解析 HTML。


Edit 2010-01-29: This would be a reasonable starting point for lxml:编辑 2010-01-29:这将是 lxml 的合理起点:

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import requests

url = "https://stackoverflow.com/questions/2165943/removing-html-tags-from-a-text-using-regular-expression-in-python"
html = requests.get(url).text

doc = fromstring(html)

tags = ['h1','h2','h3','h4','h5','h6',
       'div', 'span', 
       'img', 'area', 'map']
args = {'meta':False, 'safe_attrs_only':False, 'page_structure':False, 
       'scripts':True, 'style':True, 'links':True, 'remove_tags':tags}
cleaner = Cleaner(**args)

path = '/html/body'
body = doc.xpath(path)[0]

print cleaner.clean_html(body).text_content().encode('ascii', 'ignore')

You want the content, so presumably you don't want any javascript or CSS.您想要内容,所以大概您不需要任何 javascript 或 CSS。 Also, presumably you want only the content in the body and not HTML from the head, too.此外,大概您只想要正文中的内容,而不是头部中的 HTML。 Read up on lxml.html.clean to see what you can easily strip out.阅读lxml.html.clean以查看您可以轻松删除的内容。 Way smarter than regular expressions, no?比正则表达式更聪明,不是吗?

Also, watch out for unicode encoding problems.另外,请注意 unicode 编码问题。 You can easily end up with HTML that you cannot print.您很容易以无法打印的 HTML 结束。


2012-11-08: changed from using urllib2 to requests . 2012-11-08:从使用 urllib2 更改为requests Just use requests!只需使用请求!

import re
patjunk = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)
url="http://www.yahoo.com"
def test(url,pat):
    html = urllib2.urlopen(url).read()
    return pat.sub("",html)

print test(url,patjunk)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM