循环URL以使用漂亮的汤蟒抓取

Question

I am using the following code to scrape the website. 我正在使用以下代码来抓取网站。 The following which I tried works fine for a page in the website. 我尝试过的以下内容对于网站上的页面效果很好。 Now I want to scrape several such pages for which I am looping the URL as shown below. 现在，我要抓取几个这样的页面，如下所示，这些页面我正在为其循环URL。

from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
    page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    soup = BeautifulSoup(page.read())
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

The following is the normal code without loop 以下是没有循环的普通代码

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
    print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))

I am looping the id value in the URL from 2500 to 7000. But there are many id 's for which there is no value. 我正在将URL中的id值从2500循环到7000。但是有很多id都没有值。 So there are no such pages. 因此，没有这样的页面。 How do I skip those pages and scrape data only when there exists data for given id . 仅当存在给定id数据时，如何跳过这些页面并刮取数据。

Answer 1

you can either try catch the result ( https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java ): 您可以尝试捕获结果（ https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java ）：

for i in xrange(2500,7000):
    try:
        page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    except:
        continue
    else:
        soup = BeautifulSoup(page.read())
        for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
            print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
            print '\n'
            number = number + 1

or use a (great) lib such as requests and check before scrapping 或使用（great）lib（例如请求）并在报废之前进行检查

import requests
for i in xrange(2500,7000):
    page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

basically there's no way for you to know if the page with that id exists before calling the url. 基本上，您无法在调用url之前知道具有该ID的页面是否存在。

Answer 2

尝试在该网站上找到索引页面，否则，在尝试访问URL之前您根本无法分辨

循环URL以使用漂亮的汤蟒抓取

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-11-12 18:24:46

解决方案2
0 2013-11-12 18:18:35

循环URL以使用漂亮的汤蟒抓取

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-11-12 18:24:46

解决方案2 0 2013-11-12 18:18:35

解决方案1
2 已采纳 2013-11-12 18:24:46

解决方案2
0 2013-11-12 18:18:35