简体繁体中英

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

原文 2011-06-27 14:11:55 7 4 python/ html/ parsing/ beautifulsoup/ html-parsing

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser , SGMLParser , or Beautiful Soup.

Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice?

Also, please let me know if I have overlooked a significant option.

Edit: I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

4 answers

I am using and would recommend lxml and pyquery for parsing HTML. I had to write a web scraping bot a few month ago and of all the popular alternatives I tried, including HTMLParser and BeautifulSoup , I went with lxml and the syntax sugar of pyquery . I haven't tried SGMLParser though.

For what I've seen, lxml is more or less the most feature-rich library and its underlying C core is quite performant when compared to its alternatives. As for pyquery , I really liked its jQuery-inspired syntax which makes navigating the DOM more enjoyable.

Here are some resources you might find useful in case you decide to give it a try:

lxml home page
pyquery home page
BeautifulSoup vs lxml benchmark
Win installer for pyquery built against Python 2.7 - I had a hard time setting up pyquery :)

Well, that's my 2c :) I hope this helps.

BeautifulSoup in particular is for dirty HTML as found in the wild. It will parse any old thing, but is slow.

A very popular choice these days is lxml.html, which is fast, and can use BeautifulSoup if needed.

Take a look at Scrapy . It is a python framework specifically for scrapping. It makes it very easy to extract information using the XPath to the element. It also has some very interesting capabilities such as defining models for the scraped data (to be able to export it in different formats), authentication and recursively following links.

Well, software is like cars....different flavors about all do drive!

Go with BeautifulSoup (4).

Parsing html using Beautiful Soup in python

Parsing HTML with Beautiful Soup

Parsing html in Beautiful soup

parsing html by using beautiful soup and selenium in python

Python html parsing using beautiful soup issues

Beautiful Soup 4 HTML parsing

HTML parsing with Beautiful soup

Python 2.7 Beautiful Soup- parsing list of links

python parsing with beautiful soup

Parsing html document with Beautiful Soup

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Parsing html using Beautiful Soup in python Parsing HTML with Beautiful Soup Parsing html in Beautiful soup parsing html by using beautiful soup and selenium in python Python html parsing using beautiful soup issues Beautiful Soup 4 HTML parsing HTML parsing with Beautiful soup Python 2.7 Beautiful Soup- parsing list of links python parsing with beautiful soup Parsing html document with Beautiful Soup

Related Tags

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

Question

4 answers

solution1
14 ACCPTED 2011-06-27 14:56:07

solution2
6 2011-06-27 14:32:33

solution3
1 2013-11-11 03:50:02

solution4
-4 2011-06-27 14:18:32

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

Question

4 answers

solution1 14 ACCPTED 2011-06-27 14:56:07

solution2 6 2011-06-27 14:32:33

solution3 1 2013-11-11 03:50:02

solution4 -4 2011-06-27 14:18:32

solution1
14 ACCPTED 2011-06-27 14:56:07

solution2
6 2011-06-27 14:32:33

solution3
1 2013-11-11 03:50:02

solution4
-4 2011-06-27 14:18:32