在一个混乱的网站上使用美丽汤进行Python Web抓取

Question

I want to scrape the following three data points from this site : %verified, the numerical value for FAR, and the numerical value for POD. 我想从该站点抓取以下三个数据点：％verified，FAR的数值和POD的数值。 I'm trying to do this in BeautifulSoup, but I'm not practiced in site traversing, so I can't describe the location of those elements. 我正尝试在BeautifulSoup中执行此操作，但没有在站点遍历中进行练习，因此无法描述这些元素的位置。

What is the easiest way to go about doing this? 这样做最简单的方法是什么？

Answer 1

If you haven't yet, install Firebug for Firefox and use it to inspect the html source of the page. 如果尚未安装Firebug for Firefox，并使用它来检查页面的html源。

Use a combination of urllib and BeautifulSoup to handle html retrieval and parsing. 结合使用urllib和BeautifulSoup来处理html检索和解析。 Here is a short example: 这是一个简短的示例：

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype[]=TO&hail=1.00&lsrbuffer=15&ltype[]=T&wind=58'
fp = urllib.urlopen(url).read()
soup = BeautifulSoup(fp)

print soup

From here, the links I provided should give you a good start into how to retrieve the elements you're interested in. 从这里开始，我提供的链接应该为您提供一个良好的起点，让您开始如何检索您感兴趣的元素。

Answer 2

Like That1Guy's says, you need to analyse the source page structure. 就像That1Guy所说的一样，您需要分析源页面的结构。 In this case, you're lucky... the numbers you are looking a specifically highlighted in red using <span> . 在这种情况下，您很幸运...使用<span>将要查找的数字特别突出显示为红色。

This will do this: 这样做：

>>> import urllib2
>>> import lxml.html
>>> url = ... # put your URL here
>>> html = urllib2.urlopen(url)
>>> soup = lxml.html.soupparser.fromstring(html)
>>> elements = soup.xpath('//th/span')
>>> print float(elements[0].text) # FAR
0.67
>>> print float(elements[1].text) # POD
0.58

Note lxml.html.soupparser is pretty much equivalent to the BeautifulSoup parser (which I don't have to hand at the moment). 注意lxml.html.soupparser几乎等同于BeautifulSoup解析器（此刻我不需要处理）。

Answer 3

I ended up solving it myself-- I was utilizing a strategy similar to isedev, but I was hoping I could find a better way of getting the 'Verified' Data: 我最终自己解决了这一问题-我正在使用类似于isedev的策略，但我希望可以找到一种更好的方式来获取“已验证”数据：

import urllib2
from bs4 import BeautifulSoup

wfo = list()

def main():
    wfo = [i.strip() for i in open('C:\Python27\wfo.txt') if i[:-1]]
    soup = BeautifulSoup(urllib2.urlopen('http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype%5B%5D=TO&hail=1.00&lsrbuffer=15&ltype%5B%5D=T&wind=58').read())
    elements = soup.find_all("span")
    find_verify = soup.find_all('th')

    far= float(elements[1].text)
    pod= float(elements[2].text)
    verified = (find_verify[13].text[:-1])

在一个混乱的网站上使用美丽汤进行Python Web抓取

问题描述

3 个解决方案

解决方案1
2 2013-02-01 18:22:59

解决方案2
1 2013-02-01 18:40:04

解决方案3
1 已采纳 2013-02-01 19:12:01

在一个混乱的网站上使用美丽汤进行Python Web抓取

问题描述

3 个解决方案

解决方案1 2 2013-02-01 18:22:59

解决方案2 1 2013-02-01 18:40:04

解决方案3 1 已采纳 2013-02-01 19:12:01

解决方案1
2 2013-02-01 18:22:59

解决方案2
1 2013-02-01 18:40:04

解决方案3
1 已采纳 2013-02-01 19:12:01