简体   繁体   中英

Random “IndexError: list index out of range ”

I am trying to scrape a site that returns its data via Javascript. The code I wrote using BeautifulSoup works pretty well, but at random points during scraping I get the following error:

Traceback (most recent call last):
File "scraper.py", line 48, in <module>
accessible = accessible[0].contents[0]
IndexError: list index out of range

Sometimes I can scrape 4 urls, sometimes 15, but at some point the script eventually fails and gives me the above error. I can find no pattern behind the failing, so I'm really at a loss here - what am I doing wrong?

from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time

countries = csv.reader(open("countries.csv", 'rb'), delimiter=",")
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',')

basepage = "https://www.herdict.org/explore/"
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382"
ccode = "#fc=IN"
end_date = "&fed=12/31/"
start_date = "&fsd=01/01/"

year_range = range(2009, 2011)
years = [str(year) for year in year_range]

def get_number(var):
    number = re.findall("(\d+)", var)

    if len(number) > 1:
        thing = number[0] + number[1]
    else:
        thing = number[0]

    return thing

def create_link(basepage, session_id, ccode, end_date, start_date, year):
    link = basepage + session_id + ccode + end_date + year + start_date + year
    return link



for ccode, name in countries:
    for year in years:
        link = create_link(basepage, session_id, ccode, end_date, start_date, year)
        print link
        html = jw.get_page(link)
        soup = BeautifulSoup(html, "lxml")

        accessible = soup.find_all("em", class_="accessible")
        inaccessible = soup.find_all("em", class_="inaccessible")

        accessible = accessible[0].contents[0]
        inaccessible = inaccessible[0].contents[0]

        acc_num = get_number(accessible)
        inacc_num = get_number(inaccessible)

        print acc_num
        print inacc_num
        database.writerow([name]+[year]+[acc_num]+[inacc_num])

        time.sleep(2)

You need to add error-handling to your code. When scraping a lot of websites, some will be malformed, or somehow broken. When that happens, you'll be trying to manipulate empty objects.

Look through the code, find all assumptions where you're assuming it works, and check against errors.

For that specific case, I would do this:

if not inaccessible or not accessible:
    # malformed page
    continue

soup.find_all("em", class_="accessible") is probably returning an empty list. You can try:

if accessible:
    accessible = accessible[0].contents[0]

or more generally:

if accessibe and inaccesible:
    accessible = accessible[0].contents[0]
    inaccessible = inaccessible[0].contents[0]
else:
    print 'Something went wrong!'
    continue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM