简体   繁体   中英

Python - Index Error - list index out of range

I am parsing data from a website but getting the error "IndexError: list index out of range". But, at the time of debugging I got all the values. Previously, it worked completely fine, but suddenly can't understand why I am getting this error.

str2 = cols[1].text.strip()

IndexError: list index out of range

Here is my code.

import requests
import DivisionModel
from bs4 import BeautifulSoup
from time import sleep


class DivisionParser:

    def __init__(self, zoneName, zoneUrl):
        self.zoneName = zoneName
        self.zoneUrl = zoneUrl

    def getDivision(self):

        response = requests.get(self.zoneUrl)
        soup = BeautifulSoup(response.content, 'html5lib')
        table = soup.findAll('table', id='mytable')
        rows = table[0].findAll('tr')

        division = []
        for row in rows:
            if row.text.find('T No.') == -1:
                cols = row.findAll('td')

                str1 = cols[0].text.strip()
                str2 = cols[1].text.strip()
                str3 = cols[2].text.strip()
                strurl = cols[2].findAll('a')[0].get('href')
                str4 = cols[3].text.strip()
                str5 = cols[4].text.strip()
                str6 = cols[5].text.strip()
                str7 = cols[6].text.strip()

                divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6, str7)
                division.append(divisionModel)
        return division


These are the values at the time of debugging:

str1 = {str} '1'
str2 = {str} 'BHUSAWAL DIVN-ENGINEERING'
str3 = {str} 'DRMWBSL692019t1'
str4 = {str} 'Bhusawal Division - TRR/P- 44.898Tkms & 2.225Tkms on 9 Bridges total 47.123Tkms on ADEN MMR &'
str5 = {str} 'Open'
str6 = {str} '23/12/2019 15:00'
str7 = {str} '5'
strurl = {str} '/works/pdfdocs/122019/51822293/viewNitPdf_3021149.pdf'

As a general rule, whatever comes from the cold and hostile outside world is totally unreliable. Here:

    response = requests.get(self.zoneUrl)
    soup = BeautifulSoup(response.content, 'html5lib')

you seem to suffer from the terrible delusion that the response will always be what you expect. Hint: it wont. It is guaranteed that sometimes the response will be something different - could be that the site is down, or decided to blacklist your IP because they don't like having you scraping their data, or whatever.

IOW, you really want to check the response's status code, AND the response content. Actually, you want to be prepared to just anything - FWIW, since you don't specify a timeout , your code could just stay frozen forever waiting for a response

so actually what you want here is along the line of

try:
    response = requests.get(yoururl, timeout=some_appropriate_value)
    # cf requests doc
    response.raise_for_status() 

# cf requests doc
except requests.exceptions.RequestException as e
    # nothing else you can do here - depending on
    # the context (script ? library code ?), 
    # you either want to re-raise the exception
    # raise your own exception or well, just
    # show the error message and exit. 
    # Only you can decide what's the appropriate course
    print("couldn't fetch {}: {}".format(yoururl, e))
    return

 if not response.headers['content-type'].startswith("text/html"):
     # idem - not what you expected, and you can't do much
     # except mentionning the fact to the caller one way
     # or another. Here I just print the error and return
     # but if this is library code you want to raise an exception
     # instead
     print("{} returned non text/html content {}".format(yoururl, response.headers['content-type'])) 
     print("response content:\n\n{}\n".format(response.text))
     return

 # etc...

request has some rather exhaustive doc, I suggest you read more than the quickstart to learn and use it properly. And that's only half the job - even if you do get a 200 response with no redirections and the right content type, it doesn't mean the markup is what you expect, so here again you have to double-check what you get from BeautifulSoup - for example here:

table = soup.findAll('table', id='mytable')
rows = table[0].findAll('tr')

There's absolutely no garantee that the markup contains any table with a matching id (nor any table at all FWIW), so you have to either check beforehand or handle exceptions:

table = soup.findAll('table', id='mytable')
if not tables:
    # oops, no matching tables ?
    print("no table 'mytable' found in markup")
    print("markup:\n{}\n".format(response.text))
    return
rows = table[0].findAll('tr')
# idem, the table might be empty, etc etc

One of the fun things with programming is that handling the nominal case is often rather straightforward - but then you have to handle all the possible corner cases, and this usually requires as much or more code than the nominal case ;-)

when i am parsing data from website by checking T No. in a row and get all the values in td, website developer put "No Result" in some td row, so that's why at the run time my loop will not able to get values and throw "list index out of range error."

Well thanks to all for the help.

class DivisionParser:

def __init__(self, zoneName, zoneUrl):
    self.zoneName = zoneName
    self.zoneUrl = zoneUrl

def getDivision(self):
    global rows
    try:
        response = requests.get(self.zoneUrl)
        soup = BeautifulSoup(response.content, 'html5lib')
        table = soup.findAll('table', id='mytable')
        rows = table[0].findAll('tr')
    except IndexError:
            sleep(2)

    division = []
    for row in rows:
        if row.text.find('T No.') == -1:
            try:
                cols = row.findAll('td')

                str1 = cols[0].text.strip()
                str2 = cols[1].text.strip()
                str3 = cols[2].text.strip()
                strurl = cols[2].findAll('a')[0].get('href')
                str4 = cols[3].text.strip()
                str5 = cols[4].text.strip()
                str6 = cols[5].text.strip()
                str7 = cols[6].text.strip()
                divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6,
                                                            str7)
                division.append(divisionModel)
            except IndexError:
                print("No Result")
    return division

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM