索引错误：列表索引超出范围 - 如何跳过损坏的 URL？

Question

如何告诉我的程序跳过损坏/不存在的 URL 并继续执行任务？ 每次我运行它时，它都会在遇到不存在的 URL 并给出错误时停止： index error: list index out of range 。

范围是 1 到 450 之间的 URL，但混合中的某些页面已损坏（例如，URL 133 不存在）。

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

df = pd.DataFrame()

for id in range (1, 450):

      url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
      res = requests.get(url)
      soup = BeautifulSoup(res.content, "lxml")
      s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
      s = s.replace('null','"placeholder"')
      data = json.loads(s)
      data = json_normalize(data)
      matsit = pd.DataFrame(data)
      df = pd.concat([df, matsit], axis=0)


df.to_csv("matsit.csv", index=False)

Answer 1

我会假设您的索引错误来自带有以下语句的代码行：

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

你可以这样解决它：

try:
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
except IndexError as IE:
    print(f"Indexerror: {IE}")
    continue

如果上面的行没有发生错误，只需在发生索引错误的行上捕获异常即可。 或者，您也可以只捕获所有异常


try:
    code_where_exception_occurs
except Exception as e:
    print(f"Exception: {e}")
    continue

但我建议尽可能具体，以便您以适当的方式处理所有预期的错误。 在上面的示例中，将 code_where_exception_occurs 替换为代码。 您也可以将 try/except 子句放在 for 循环内的整个代码块周围，但最好单独捕获所有异常。 这也应该有效：

try:
    url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    s = s.replace('null','"placeholder"')
    data = json.loads(s)
    data = json_normalize(data)
    matsit = pd.DataFrame(data)
    df = pd.concat([df, matsit], axis=0)
except Exception as e:
    print(f"Exception: {e}")
    continue

Answer 2

主要问题是某些 url 出现204 error （例如： https://liiga.fi/api/v1/shotmap/2022/405 ），因此只需使用if-statement来检查和处理此问题：

for i in range (400, 420):
    url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
    r=requests.get(url)
    
    if r.status_code != 200:
        print(f'Error occured: {r.status_code} on url: {url}')
        #### log or do what ever you like to do in case of error
    else:
        data.append(pd.json_normalize(r.json()))

注意：正如https://stackoverflow.com/a/73584487/14460824中已经提到的，没有必要使用BeautifulSoup ，而是直接使用pandas来保持您的 CEF2FAZC 代码

例子

import requests, time
import pandas as pd

data = []
for i in range (400, 420):
    url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
    r=requests.get(url)
    
    if r.status_code != 200:
        print(f'Error occured: {r.status_code} on url: {url}')
    else:
        data.append(pd.json_normalize(r.json()))

pd.concat(data, ignore_index=True)#.to_csv("matsit", index=False)

Output

Error occured: 204 on url: https://liiga.fi/api/v1/shotmap/2022/405

索引错误：列表索引超出范围 - 如何跳过损坏的 URL？

问题描述

2 个解决方案

解决方案1
0 已采纳 2022-09-08 07:43:30

解决方案2
0 2022-09-08 08:00:28

例子

Output

索引错误：列表索引超出范围 - 如何跳过损坏的 URL？

问题描述

2 个解决方案

解决方案1 0 已采纳 2022-09-08 07:43:30

解决方案2 0 2022-09-08 08:00:28

例子

Output

解决方案1
0 已采纳 2022-09-08 07:43:30

解决方案2
0 2022-09-08 08:00:28