简体   繁体   中英

Parsing HTML with Python 2.7

Evening folks (or morning depending on where you are :) ).

I'm looking to parse a webpage which contains multiple segments similar to the below:-

> <p><a name="Abercrombie"></a></p> <h3>Abercrombie Council</h3> <p>Mr
> Billy Smith<br />The Managing Director<br />123 Jones Street,
> London<br />T:02081234567<br /><a
> href="mailto:billysmith@example.com">Email</a></p>

What I'm wishing to do is to capture the source code from the webpage and then parse through it extracting the unique info above and place this into rows in a tab delimited document with a new line at the end - splitting up the title, name of office, name of individual, job role, address, telephone number, email address.

I've been looking at using BeautifulSoup but I'm just wondering if there's any other tools that are more suitable?

I'd say BeautifulSoup would be your best and easiest option and parse pages or chunks of HTML. You can also try scrapy or even scraperwiki

Sample Usage for BS

import BeautifulSoup
import urllib2

get = urllib2.urlopen('http://site.com').read()
dom = BeautifulSoup.BeautifulSoup(get)
data = dom.findAll('p', {'class' : 'address'}) # <p class='address'>....</p>

for i in data:
    print data

More examples: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

BeautifulSoup是一个不错的流行库,但是您也可以看看lxml

Web抓取框架Scrapy是此类任务http://scrapy.org/的不错选择,因为它不仅可以解析和提取数据,而且还可以运行自动抓取作业。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM