简体   繁体   中英

IndexError: string index out of range [python, scraping]

I am trying to scrape a website but only want to write specific rows to my final csv file. When I try to specify the rows I get

IndexError: string index out of range.

I do not get this error when I run this code:

rows = [
["The Conservation Fund",2014,"","","","Program Services: ","$174,530,077"],
["The Conservation Fund",2014,"","","","Administration: ","$2,810,944"],
["The Conservation Fund",2014,"","","","Fundraising: ","$2,144,456"],
["The Conservation Fund",2013,"$480,674","$55,266","$0","LAWRENCE A SELZER","PRESIDENT & CEO"],
["The Conservation Fund",2013,"$369,848","$54,856","$0","RICHARD L ERDMANN","EXECUTIVE VICE PRESIDENT"],
["The Conservation Fund",2013,"$312,232","$44,386","$0","DAVID K PHILLIPS JR","EXECUTIVE VP AND CFO"],
["The Conservation Fund",2013,"$251,615","$16,125","$0","DEAN H CANNON","SENIOR VP/GENERAL COUNSEL"]]

rows1 = [x for x in rows if x[6][0] != '$']
print(rows1)

I get exactly what I expect which is:

[['The Conservation Fund', 2013, '$480,674', '$55,266', '$0', 'LAWRENCE A SELZER', 'PRESIDENT & CEO'], ['The Conservation Fund', 2013, '$369,848', '$54,856', '$0', 'RICHARD L ERDMANN', 'EXECUTIVE VICE PRESIDENT'], ['The Conservation Fund', 2013, '$312,232', '$44,386', '$0', 'DAVID K PHILLIPS JR', 'EXECUTIVE VP AND CFO'], ['The Conservation Fund', 2013, '$251,615', '$16,125', '$0', 'DEAN H CANNON', 'SENIOR VP/GENERAL COUNSEL']]

Now when I try to run this similar list comprehension from my scraper (I will paste some of the code here, because I legally cannot post the whole thing):

for page in eins:
    rows =[]
    driver.get(page)
    print("Getting {}".format(page))
    soup = BeautifulSoup(driver.page_source, "lxml")
    name = soup.find("h1", {"class" : "centered"})
    print(name.text)
    members = soup.findAll("g", { "transform" : "translate(0,0)"})
    time = soup.find("option", {"selected" : "selected"}).text
    time = int(time)
    for year in members[2:]:
        column = year.find_all("g")
        for thing in column:
            row_info = [name.text, time]
            entries = thing.find_all("text")
            if len(entries) != 5:
                row_info.extend((5 - len(entries)) * [""])
            for entry in entries:
                    row_info.append(entry.text)
            rows.append(row_info)
        time = time - 1
        rows1 = [x for x in rows if x[6][0] != "$"]

Now suddenly I get the following error code

Traceback (most recent call last):
  File "Board_members.py", line 53, in <module>
    rows1 = [x for x in rows if x[6][0] != "$"]
  File "Board_members.py", line 53, in <listcomp>
    rows1 = [x for x in rows if x[6][0] != "$"]
IndexError: string index out of range

is the rows list not formatted in the same way in both instance? What am I doing wrong here. I tried a for loop with a continue function earlier and simple if statements but everything comes down to the same error.

I am still a beginner so please forgive my flimsy code. I looked around here for answers to the question, but if they were there I just could not understand them. Thank you so much!

edit: just for context the rows in the first instance comes from a csv file that I managed to create using the scraper and it looks like this in the csv.

organization,year,compensation,other,related,name,position
The Conservation Fund,2015,,,,Total Revenue: ,"$215,096,466"
The Conservation Fund,2015,,,,Contributions: ,"$114,351,967"
The Conservation Fund,2015,,,,Gov't Grants: ,"$9,723,802"
The Conservation Fund,2015,,,,Program Services: ,"$90,762,036"
The Conservation Fund,2015,,,,Investments: ,"$220,002"
The Conservation Fund,2015,,,,Special Events: ,$0
The Conservation Fund,2015,,,,Sales: ,$0
The Conservation Fund,2015,,,,Other: ,"$38,659"
The Conservation Fund,2014,,,,Total Expenses: ,"$179,485,477"
The Conservation Fund,2014,,,,Program Services: ,"$174,530,077"
The Conservation Fund,2014,,,,Administration: ,"$2,810,944"
The Conservation Fund,2014,,,,Fundraising: ,"$2,144,456"
The Conservation Fund,2013,"$480,674","$55,266",$0,LAWRENCE A SELZER,PRESIDENT & CEO
The Conservation Fund,2013,"$369,848","$54,856",$0,RICHARD L ERDMANN,EXECUTIVE VICE PRESIDENT
The Conservation Fund,2013,"$312,232","$44,386",$0,DAVID K PHILLIPS JR,EXECUTIVE VP AND CFO

edit 2: and this is the output I get from printing rows before rows1:

[['The Conservation Fund', 2015, '', '', '', 'Total Revenue: ', '$215,096,466'], ['The Conservation Fund', 2015, '', '', '', 'Contributions: ', '$114,351,967'], ['The Conservation Fund', 2015, '', '', '', "Gov't Grants: ", '$9,723,802'], ['The Conservation Fund', 2015, '', '', '', 'Program Services: ', '$90,762,036'], ['The Conservation Fund', 2015, '', '', '', 'Investments: ', '$220,002'], ['The Conservation Fund', 2015, '', '', '', 'Special Events: ', '$0'], ['The Conservation Fund', 2015, '', '', '', 'Sales: ', '$0'], ['The Conservation Fund', 2015, '', '', '', 'Other: ', '$38,659'], ['The Conservation Fund', 2014, '', '', '', 'Total Expenses: ', '$179,485,477'], ['The Conservation Fund', 2014, '', '', '', 'Program Services: ', '$174,530,077']]

The error you are getting is

IndexError: string index out of range

which implies that you are trying to fetch the string index that doesn't exist.

See below example to see what can cause IndexError: string index out of range

test = 'abc'
test[2] # Output : c
test[3] # Output :  IndexError: string index out of range

test1 = ''
test1[0] # Output :  IndexError: string index out of range
test1[1] # Output :  IndexError: string index out of range

In your case in satement rows1 = [x for x in rows if x[6][0] != "$"] ; x[6] has no value or an empty string; In statement x[6][0] - You are tried to fetch 0 index of an empty string.

Use below code that might fix the error because the below code will first check for empty value of x then x[6]

rows1 = [x for x in rows if x and x[6] and x[6][0] != "$"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM