Extract text from HTML faster than NLTK?

Question

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, eg word count.

Is there a faster way to extract visible text from HTML using Python?

Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.

Answer 1

Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text

You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

Extract text from HTML faster than NLTK?

Question

1 answers

solution1
2 ACCPTED 2017-11-10 00:00:02

Extract text from HTML faster than NLTK?

Question

1 answers

solution1 2 ACCPTED 2017-11-10 00:00:02

solution1
2 ACCPTED 2017-11-10 00:00:02