简体   繁体   English

使用Python从HTML中提取可读文本?

[英]Extracting readable text from HTML using Python?

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them. 我知道像html2text,BeautifulSoup等的utils,但问题是他们也提取javascript并将其添加到文本中,因此很难将它们分开。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately, 交替,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired. 这两个都提取了页面上的所有javascript,这是不受欢迎的。

I just wanted the readable text which you could copy from your browser to be extracted. 我只是想要提取您可以从浏览器中复制的可读文本。

If you want to avoid extracting any of the contents of script tags with BeautifulSoup, 如果您想避免使用BeautifulSoup提取script标记的任何内容,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). 会为你做到这一点,让root的直接子htmlDom.findAll(recursive=False, text=True)是非脚本标签(和一个单独的htmlDom.findAll(recursive=False, text=True)将获得直接子htmlDom.findAll(recursive=False, text=True)字符串)。 You need to do this recursively; 你需要递归地做这件事; eg, as a generator: 例如,作为发电机:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll ) so that I can just get all the children in order and do my own filtering. 我正在使用childGenerator (代替findAll ),这样我就可以让所有的孩子按顺序完成自己的过滤。

Using BeautifulSoup, something along these lines: 使用BeautifulSoup,有以下几点:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

you can remove script tags in beautiful soup, something like: 你可以删除漂亮汤中的脚本标签,例如:

for script in soup("script"):
    script.extract()

Removing Elements 删除元素

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM