簡體   English   中英

IndexError:創建變量為數字的列表時列表索引超出范圍,但在打印中工作正常,為什么?

[英]IndexError: list index out of range when creating a list with variable as number, but works fine in print, why?

Python showed this message while print works but adding the list to the list doesn't: Web scraping a list of names and sites of colleges, I used the regex to separate sites and append the sites in college_site list but the error says: list index即使超出范圍,它也從循環的開頭開始並在循環的結尾結束,程序員? 我在哪里改變?

我的代碼是:

import requests
from bs4 import BeautifulSoup
import json
import re


URL = 'http://doors.stanford.edu/~sr/universities.html'

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

college_site = []


def college():
    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    site = "\w+\.+\w+\)"

    for ol in soup.find_all('ol'):
        for num in range(len((ol.get_text()))):
            line = ol.get_text().split()
            if (re.search(site, line[num])):
                college_site.append(line[num])
# works if i put: print(line[num])


    with open('E:\Python\mails for college\\test2\sites.json', 'w') as sites:
        json.dump(college_site, sites)


if __name__ == '__main__':
    college()

要獲取大學和鏈接列表,您可以使用以下示例:

import requests
from bs4 import BeautifulSoup
import json


URL = 'http://doors.stanford.edu/~sr/universities.html'

headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

college_sites = []

def college():
    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')

    for li in soup.select('ol li'):
        college_name = li.a.get_text(strip=True)
        college_link = li.a.find_next_sibling(text=True).strip()
        print(college_name, college_link)

        college_sites.append((college_name, college_link))

    with open('data.json', 'w') as sites:
        json.dump(college_sites, sites, indent=4)


if __name__ == '__main__':
    college()

印刷:

Abilene Christian University (acu.edu)
Adelphi University (adelphi.edu)
Agnes Scott College (scottlan.edu)
Air Force Institute of Technology (afit.af.mil)
Alabama A&M University (aamu.edu)
Alabama State University (alasu.edu)
Alaska Pacific University 
Albertson College of Idaho (acofi.edu)
Albion College (albion.edu)
Alderson-Broaddus College 
Alfred University (alfred.edu)
Allegheny College (alleg.edu)

...

並保存data.json

[
    [
        "Abilene Christian University",
        "(acu.edu)"
    ],
    [
        "Adelphi University",
        "(adelphi.edu)"
    ],
    [
        "Agnes Scott College",
        "(scottlan.edu)"
    ],

...

問題在於這部分: for num in range(len((ol.get_text()))) 您想遍歷行,但您的循環正在遍歷每個字符。 修復很簡單。

改變:

        for num in range(len((ol.get_text()))):
            line = ol.get_text().split()`

至:

        line = ol.get_text().split()
        for num in range(len(line)):

完整示例:

import requests
from bs4 import BeautifulSoup
import json
import re


URL = 'http://doors.stanford.edu/~sr/universities.html'

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

college_site = []


def college():
    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    site = "\w+\.+\w+\)"

    for ol in soup.find_all('ol'):
        line = ol.get_text().split()
        for num in range(len(line)):
            if (re.search(site, line[num])):
                college_site.append(line[num])


    with open('E:\Python\mails for college\\test2\sites.json', 'w') as sites:
        json.dump(college_site, sites)


if __name__ == '__main__':
    college()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM