如何使用 Python/Beautiful Soup 从 URL 不变的网页的多个页面中抓取超链接？

Question

此网页上有一个分页的超链接列表：https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/ 。

到目前为止，我创建的代码从第一页抓取了相关链接。 我不知道如何从后续页面中提取链接（每页8 个链接，大约 25 页）。

似乎没有办法使用 URL 导航页面。


    from bs4 import BeautifulSoup
    import urllib.request
    
    # Scrape webpage
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
    resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
    soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
    
    # Extract links
    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])
    
    # Select relevant links, reformat, and drop duplicates    
    links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))

请告知我如何使用 Python 执行此操作。

Answer 1

我已经用 Selenium 解决了这个问题。谢谢。

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")

# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
    for ii in range(2,12):
        try:
            # Click page
            driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
        except:
            iStop = True
            break
        # Wait to load
        time.sleep(0.1)
        # Identify elements with tagname <a> 
        lnks=driver.find_elements_by_tag_name("a")
        # Traverse list of links
        iiLnks = []
        for lnk in lnks:
           # Use get_attribute() to get all href and add links to list
           iiLnks.append(lnk.get_attribute("href"))
        # Select relevant links, reformat, and drop duplicates    
        iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))        
        allLnks = allLnks + iiLnks
    driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()

如何使用 Python/Beautiful Soup 从 URL 不变的网页的多个页面中抓取超链接？

问题描述

1 个解决方案

解决方案1
0 2022-05-07 17:16:49

如何使用 Python/Beautiful Soup 从 URL 不变的网页的多个页面中抓取超链接？

问题描述

1 个解决方案

解决方案1 0 2022-05-07 17:16:49

解决方案1
0 2022-05-07 17:16:49