Python webscraping 美丽的汤列表错误

Question

Hi I am trying to download images from BGS borehole scans where there are more than one page eg http://scans.bgs.ac.uk/sobi_scans/boreholes/795279/images/10306199.html http://scans.bgs. ac.uk/sobi_scans/boreholes/18913699/images/18910430.html

我设法下载了第一个示例的前 2 页，但是当我到达最后一页时，我收到了这个错误。 在此页面上，NextPage 变量应为 None，因为网页上没有该标签。 此时我想继续到下一个位置，我还没有添加，但我有一个 excel 的 URL 列表。 该代码基于此https://automatetheboringstuff.com/2e/chapter12/

Traceback（最近一次调用最后）：文件“C：/Users/brentond/Documents/Python/Pdf BGS Scans.py”，第 73 行，在 NextPage = soup.select('a[title="Next page"]') [0] IndexError: 列表索引超出范围

从 excel URL 列表下载 BGS 钻孔扫描

import pyautogui
import pyperclip
import webbrowser
import PyPDF2
import os
import openpyxl
import pdfkit
import requests
import bs4



# Define path of excel file
from requests import Response

path = r'C:\Users\brentond\Documents\TA2'

# Change directory to target location
os.chdir(path)

# Create workbook object
wb = openpyxl.load_workbook('BGS Boreholes.xlsm')

# Create worksheet object
ws = wb.get_sheet_by_name('Open')

# Assign URL to variable
StartURL = ws['A2'].value
URL = StartURL
NextURL = "NextURL"

# Assign BH ID to variable
Location = ws['B2'].value

while NextURL is not None:
    # Download URL
    res = requests.get(URL)  # type: Response
    res.raise_for_status()

    # Create beautiful soup object
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Find the URL of the borehole scan image.
    Scan = soup.select('#image_content img')

    # Check on HTML elements
    Address = soup.select('#image')
    AddressText = Address[0].get('src')
    print(AddressText)

    print()
    if Scan == []:
        print('Could not find scan image.')
    else:
        ScanUrl = Scan[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (ScanUrl))
        res = requests.get(ScanUrl)
        res.raise_for_status()

        # Save the image to path
        PageNo = 0
        imageFile = open(os.path.join(path, Location) + "-Page" + str(PageNo) + ".png", 'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

    # Find URL for next page
    PageNo = PageNo + 1
    NextPage = soup.select('a[title="Next page"]')[0]
    if NextPage ==[]:
        continue
    else:
        print(NextPage)
        NextURL = NextPage.get('href')
        URL = NextURL
        print(NextURL)

print('Done.')

Answer 1

如果它不存在，则不能 select 它的第一个元素。 您可以先尝试使用find / find_all验证元素是否存在，或者您可以使用try / except来解决 IndexError 并在错误情况下修改脚本行为。

Answer 2

因此，由于锚点不存在， soup.select('a[title="Next page"]')将始终返回一个空列表。 因此键零将不存在，因此引发 IndexError。

最容易改变的事情

    NextPage = soup.select('a[title="Next page"]')[0]
    if NextPage ==[]:
        continue
    else:
        print(NextPage)
        NextURL = NextPage.get('href')

至

    NextPage = soup.select('a[title="Next page"]')
    if not NextPage:
        continue
    else:
        NextPage = NextPage[0]
        print(NextPage)
        NextURL = NextPage.get('href')

或者

    NextPage = soup.select('a[title="Next page"]')
    if not NextPage:
        continue
    else:
        print(NextPage[0])
        NextURL = NextPage[0].get('href')

根据您的个人喜好

Python webscraping 美丽的汤列表错误

问题描述

从 excel URL 列表下载 BGS 钻孔扫描

2 个解决方案

解决方案1
1 2021-01-12 11:13:56

解决方案2
1 2021-01-12 11:19:42

Python webscraping 美丽的汤列表错误

问题描述

从 excel URL 列表下载 BGS 钻孔扫描

2 个解决方案

解决方案1 1 2021-01-12 11:13:56

解决方案2 1 2021-01-12 11:19:42

解决方案1
1 2021-01-12 11:13:56

解决方案2
1 2021-01-12 11:19:42