繁体   English   中英

如何使用 Python/Beautiful Soup 从 URL 不变的网页的多个页面中抓取超链接?

[英]How to scrape hyperlinks from multiple pages on webpage with unchanging URL, using Python/Beautiful Soup?

此网页上有一个分页的超链接列表:https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/

到目前为止,我创建的代码从第一页抓取了相关链接。 我不知道如何从后续页面中提取链接(每页8 个链接,大约 25 页)。

似乎没有办法使用 URL 导航页面。


    from bs4 import BeautifulSoup
    import urllib.request
    
    # Scrape webpage
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
    resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
    soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
    
    # Extract links
    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])
    
    # Select relevant links, reformat, and drop duplicates    
    links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))

请告知我如何使用 Python 执行此操作。

我已经用 Selenium 解决了这个问题。谢谢。

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")

# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
    for ii in range(2,12):
        try:
            # Click page
            driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
        except:
            iStop = True
            break
        # Wait to load
        time.sleep(0.1)
        # Identify elements with tagname <a> 
        lnks=driver.find_elements_by_tag_name("a")
        # Traverse list of links
        iiLnks = []
        for lnk in lnks:
           # Use get_attribute() to get all href and add links to list
           iiLnks.append(lnk.get_attribute("href"))
        # Select relevant links, reformat, and drop duplicates    
        iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))        
        allLnks = allLnks + iiLnks
    driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM