简体   繁体   English

在 Python 中抓取多个页面

[英]Scraping multiple pages in Python

I am trying to scrape a page that includes 12 links.我正在尝试抓取包含 12 个链接的页面。 I need to open each of these links and scrape all of their titles.我需要打开每个链接并刮掉它们的所有标题。 When I open each page, I face multiple pages in each link.当我打开每个页面时,我会在每个链接中面对多个页面。 However, my code could only scrape the first page in all of these 12 links但是,我的代码只能抓取所有这 12 个链接的第一页

By below code, I can print all the 12 links URLs that exist on the main page.通过下面的代码,我可以打印主页上存在的所有 12 个链接 URL。

url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser') 
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
    link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
    all_urls.append(link_address)

Then, I looped in all of them.然后,我把它们都循环了。

for i in range(0,12):
    url = all_urls[i]
    res = requests.get (url)
    soup = BeautifulSoup(res.text, 'html.parser') 

The title could be extracted by below lines:标题可以通过以下几行提取:

title_news = []
news_div = soup.find_all('div', class_ = 'article')    
for container in news_div:
        title = container.h5.a.text 
        title_news.append(title)

The output of this code only includes the title for the first page of each of these 12 pages, while I need my code to go through multiple pages in these 12 URLs.此代码的 output 仅包含这 12 个页面中每一页的第一页的标题,而我需要通过这 12 个 URL 中的多个页面将我的代码转换为 go。 The below gives me the links of all the pages that exist in each of these 12 links if it defines in an appropriate loop.如果在适当的循环中定义,则以下内容为我提供了这 12 个链接中每个链接中存在的所有页面的链接。 ( It reads the pagination section and look for the next page URL link) (它读取分页部分并查找下一页 URL 链接)

page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']

How I should use a page variable inside my code to extract multiple pages in all of these 12 links and read all the titles and not only first-page titles.我应该如何在我的代码中使用页面变量来提取所有这 12 个链接中的多个页面并读取所有标题,而不仅仅是首页标题。

You can use this code to get all titles from all the pages:您可以使用此代码从所有页面获取所有标题:

import requests
from bs4 import BeautifulSoup

base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
    requests.get(base_url + "index.html").content, "html.parser"
)

title_news = []
for a in soup.select("#all a"):
    next_link = a["href"]

    print("Getting", base_url + next_link)

    while True:
        soup = BeautifulSoup(
            requests.get(base_url + next_link).content, "html.parser"
        )
        for title in soup.select("h5 a"):
            title_news.append(title.text)

        next_link = soup.select_one('a[aria-label="Next"]')["href"]

        if next_link == "#":
            break

print("Length of title_news:", len(title_news))

Prints:印刷:

Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM