简体   繁体   中英

Extract text from HTML faster than NLTK?

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, eg word count.

Is there a faster way to extract visible text from HTML using Python?

Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.

Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text

You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM