简体   繁体   English

在一个混乱的网站上使用美丽汤进行Python Web抓取

[英]Python Web-Scraping using Beautiful Soup on a messy Site

I want to scrape the following three data points from this site : %verified, the numerical value for FAR, and the numerical value for POD. 我想从该站点抓取以下三个数据点:%verified,FAR的数值和POD的数值。 I'm trying to do this in BeautifulSoup, but I'm not practiced in site traversing, so I can't describe the location of those elements. 我正尝试在BeautifulSoup中执行此操作,但没有在站点遍历中进行练习,因此无法描述这些元素的位置。

What is the easiest way to go about doing this? 这样做最简单的方法是什么?

If you haven't yet, install Firebug for Firefox and use it to inspect the html source of the page. 如果尚未安装Firebug for Firefox,并使用它来检查页面的html源。

Use a combination of urllib and BeautifulSoup to handle html retrieval and parsing. 结合使用urllibBeautifulSoup来处理html检索和解析。 Here is a short example: 这是一个简短的示例:

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype[]=TO&hail=1.00&lsrbuffer=15&ltype[]=T&wind=58'
fp = urllib.urlopen(url).read()
soup = BeautifulSoup(fp)

print soup

From here, the links I provided should give you a good start into how to retrieve the elements you're interested in. 从这里开始,我提供的链接应该为您提供一个良好的起点,让您开始如何检索您感兴趣的元素。

Like That1Guy's says, you need to analyse the source page structure. 就像That1Guy所说的一样,您需要分析源页面的结构。 In this case, you're lucky... the numbers you are looking a specifically highlighted in red using <span> . 在这种情况下,您很幸运...使用<span>将要查找的数字特别突出显示为红色。

This will do this: 这样做:

>>> import urllib2
>>> import lxml.html
>>> url = ... # put your URL here
>>> html = urllib2.urlopen(url)
>>> soup = lxml.html.soupparser.fromstring(html)
>>> elements = soup.xpath('//th/span')
>>> print float(elements[0].text) # FAR
0.67
>>> print float(elements[1].text) # POD
0.58

Note lxml.html.soupparser is pretty much equivalent to the BeautifulSoup parser (which I don't have to hand at the moment). 注意lxml.html.soupparser几乎等同于BeautifulSoup解析器(此刻我不需要处理)。

I ended up solving it myself-- I was utilizing a strategy similar to isedev, but I was hoping I could find a better way of getting the 'Verified' Data: 我最终自己解决了这一问题-我正在使用类似于isedev的策略,但我希望可以找到一种更好的方式来获取“已验证”数据:

import urllib2
from bs4 import BeautifulSoup

wfo = list()

def main():
    wfo = [i.strip() for i in open('C:\Python27\wfo.txt') if i[:-1]]
    soup = BeautifulSoup(urllib2.urlopen('http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype%5B%5D=TO&hail=1.00&lsrbuffer=15&ltype%5B%5D=T&wind=58').read())
    elements = soup.find_all("span")
    find_verify = soup.find_all('th')

    far= float(elements[1].text)
    pod= float(elements[2].text)
    verified = (find_verify[13].text[:-1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM