Parsing HTML with Python 2.7

Question

Evening folks (or morning depending on where you are :) ).

I'm looking to parse a webpage which contains multiple segments similar to the below:-

> <p><a name="Abercrombie"></a></p> <h3>Abercrombie Council</h3> <p>Mr
> Billy Smith<br />The Managing Director<br />123 Jones Street,
> London<br />T:02081234567<br /><a
> href="mailto:billysmith@example.com">Email</a></p>

What I'm wishing to do is to capture the source code from the webpage and then parse through it extracting the unique info above and place this into rows in a tab delimited document with a new line at the end - splitting up the title, name of office, name of individual, job role, address, telephone number, email address.

I've been looking at using BeautifulSoup but I'm just wondering if there's any other tools that are more suitable?

Answer 1

I'd say BeautifulSoup would be your best and easiest option and parse pages or chunks of HTML. You can also try scrapy or even scraperwiki

Sample Usage for BS

import BeautifulSoup
import urllib2

get = urllib2.urlopen('http://site.com').read()
dom = BeautifulSoup.BeautifulSoup(get)
data = dom.findAll('p', {'class' : 'address'}) # <p class='address'>....</p>

for i in data:
    print data

More examples: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

Answer 2

BeautifulSoup是一个不错的流行库，但是您也可以看看lxml

Answer 3

Web抓取框架Scrapy是此类任务http://scrapy.org/的不错选择，因为它不仅可以解析和提取数据，而且还可以运行自动抓取作业。

Parsing HTML with Python 2.7

Question

3 answers

solution1
1 2013-01-24 21:15:41

solution2
0 2013-01-24 21:10:16

solution3
0 2013-01-24 22:27:17

Parsing HTML with Python 2.7

Question

3 answers

solution1 1 2013-01-24 21:15:41

solution2 0 2013-01-24 21:10:16

solution3 0 2013-01-24 22:27:17

solution1
1 2013-01-24 21:15:41

solution2
0 2013-01-24 21:10:16

solution3
0 2013-01-24 22:27:17