我的python网络抓取工具中的KeyError和TypeError

Question

对于这个含糊不清的标题，我们深表歉意。 但是，对于我来说，用一句话概括我的问题，是没有更好的方法。

我试图从法国网站获取学生和成绩信息。 链接是这个（ http://www.bankexam.fr/resultat/2014/BACCALAUREAT/AMIENS?filiere=BACS ）

我的代码如下：

import time
import urllib2
from bs4 import BeautifulSoup
regions = {'R\xc3\xa9sultats Bac Amiens 2014':'/resultat/2014/BACCALAUREAT/AMIENS'}
base_url = 'http://www.bankexam.fr'
tests = {'es':'?filiere=BACES','s':'?filiere=BACS','l':'?filiere=BACL'}
for i in regions:
    for x in tests:
        # create the output file
        output_file = open('/Users/student project/'+ i + '_' + x + '.txt','a')
        time.sleep(2) #compassionate scraping
        section_url = base_url + regions[i] + tests[x]  #now goes to the x test page of region i 
        request = urllib2.Request(section_url)
        response = urllib2.urlopen(request)
        soup = BeautifulSoup(response,'html.parser')
        content = soup.find('div',id='zone_res')
        for row in content.find_all('tr'):
            if row.td:
                student = row.find_all('td')
                name = student[0].strong.string.encode('utf8').strip()
                try:
                    school = student[1].strong.string.encode('utf8')
                except AttributeError:
                    school = 'NA'
                result = student[2].span.string.encode('utf8')
                output_file.write ('%s|%s|%s\n' % (name,school,result))
        # Find the maximum pages to go through
        if soup.find('div','pagination'): 
            import re
            page_info = soup.find('div','pagination')
            pages = []
            for i in page_info.find_all('a',re.compile('elt')):
                try:
                    pages.append(int(i.string.encode('utf8')))
                except ValueError:
                    continue
            max_page = max(pages)
            # Now goes through page 2 to max page
            for i in range(1,max_page):
                page_url = '&p='+str(i)+'#anchor'
                section2_url = section_url+page_url
                request = urllib2.Request(section2_url)
                response = urllib2.urlopen(request)
                soup = BeautifulSoup(response,'html.parser')
                content = soup.find('div',id='zone_res')
                for row in content.find_all('tr'):
                    if row.td:
                        student = row.find_all('td')
                        name = student[0].strong.string.encode('utf8').strip()
                        try:
                            school = student[1].strong.string.encode('utf8')
                        except AttributeError:
                            school = 'NA'
                        result = student[2].span.string.encode('utf8')
                        output_file.write ('%s|%s|%s\n' % (name,school,result))

关于代码的更多描述：我创建了“区域”字典和“测试”字典，因为我还需要收集其他30个区域，而这里只包括一个用于展示的区域。 我只是对三个测试（ES，S，L）的测试结果感兴趣，所以我创建了这个“测试”字典。

不断出现两个错误，一个是

KeyError: 2

错误链接到第12行

section_url = base_url + regions[i] + tests[x]

另一个是

TypeError: cannot concatenate 'str' and 'int' objects

这链接到第10行。

我知道这里有很多信息，我可能没有列出最重要的信息来帮助您。 但是，请让我知道如何解决此问题！ 谢谢

Answer 1

问题是您在多个地方使用了变量i 。

在文件顶部附近，您可以执行以下操作：

for i in regions:

因此，在某些地方， i应该成为regions词典的关键。

当您以后再次使用它时，麻烦就来了。 您在两个地方这样做：

for i in page_info.find_all('a',re.compile('elt')):

和：

for i in range(1,max_page):

第二个是导致您异常的原因，因为分配给i的整数值不会出现在regions dict中（也不能将整数添加到字符串中）。

我建议重命名部分或全部这些变量。 给他们起有意义的名字（如果可能的话）（ i也许可以接受“ index”变量，但除非您编码打高尔夫球，否则我会避免将其用于其他任何事情）。

我的python网络抓取工具中的KeyError和TypeError

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-04-09 00:06:27

我的python网络抓取工具中的KeyError和TypeError

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-04-09 00:06:27

解决方案1
1 已采纳 2015-04-09 00:06:27