从网站抓取表格数据

Question

I am trying to scrape table data from a website using BeautifulSoup4 and Python then creating an Excel document with the results. 我正在尝试使用BeautifulSoup4和Python从网站上抓取表格数据，然后使用结果创建一个Excel文档。 So far, I have this: 到目前为止，我有这个：

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://opl.tmhp.com/ProviderManager/SearchResults.aspx?TPI=&OfficeHrs=4&ProgType=STAR&UCCIndicator=No+Preference&Cnty=&NPI=&Srvs=6&Age=All&Gndr=B&SortBy=Distance&ZipCd=78552&SrvsOfrd=0&SpecCd=0&Name=&CntySrvd=0&Plan=H3&WvrProg=0&SubSpecCd=0&AcptPnt=Y&Rad=200&LangCd=99').read())

for row in soup('table', {'class' : 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

But it isn't working to display the data. 但它无法显示数据。

Any ideas? 有任何想法吗？

Answer 1

First of all the class is StandardResultsGrid , not spad . 首先，该类是StandardResultsGrid ，而不是spad 。

Second, you don't need the tbody thing. 其次，你不需要tbody的事情。 Simply use: 只需使用：

for row in soup('table', {'class' : 'StandardResultsGrid'})[0]('tr'):

Also note, that since in the original page the row with header is included in tbody for some reason, you'll have to skip the first row, so 还要注意，因为在原来的页面标题行包含在tbody出于某种原因，你必须跳过第一行，所以

for row in soup('table', {'class' : 'StandardResultsGrid'})[0]('tr')[1:]

And note that some cells include table s in them, so you'll have to parse the contents of the td s carefully. 并请注意，某些单元格中包含table ，因此您必须仔细解析td的内容。

从网站抓取表格数据

问题描述

1 个解决方案

解决方案1
5 2013-05-26 19:41:53

从网站抓取表格数据

问题描述

1 个解决方案

解决方案1 5 2013-05-26 19:41:53

解决方案1
5 2013-05-26 19:41:53