如何使用 BeautifulSoup 循環鏈接並抓取新聞文章的內容

Question

我是 Python 的新手，我想從這個頁面獲取所有新聞文章的內容和標題： https://www.nytimes.com/search?query=china+COVID-19

但是，我當前代碼的 output 將 10 篇文章中的所有段落存儲到 1 個列表中。 我想知道如何將每個段落存儲到一個字典中，這是它所屬的文章，並將所有字典保存到 1 個列表中？

任何幫助將不勝感激！

import requests
from bs4 import BeautifulSoup
import json

response=requests.get('https://www.nytimes.com/search?query=china+COVID-19')
response.encoding='utf-8'
soupe=BeautifulSoup(response.text,'html.parser')

links = soupe.find_all('div', class_='css-1i8vfl5')

pagelinks = []
for link in links:
    url = link.contents[0].find_all('a')[0] 
 pagelinks.append('https://www.nytimes.com'+url.get('href')) 


articles=[]  

for i in pagelinks:
    response=requests.get(i)
    response.encoding='utf-8'
    soupe=BeautifulSoup(response.text,'html.parser') 
    for p in soupe.select('section.meteredContent.css-1r7ky0e div.css-53u6y8'):
        articles.append(p.text.strip())
print('\n'.join(articles))

Answer 1

import urllib3
from bs4 import BeautifulSoup as bs

def scrape(url):
    http = urllib3.PoolManager()
    response = http.request("GET", url)
    soup_page = bs(response.data, 'lxml') # you have to install lxml package
    # pip install lxml
    articles = []

    containers = soup_page.findAll("div", attrs={'class': "css-1i8vfl5"})

    for container in containers:
        title = container.find('h4', {'class':'css-2fgx4k'}).text.strip()
        description = container.find('p', {'class':'css-16nhkrn'})

        article = {
            'title':title,
            'description':description
        }

        articles.append(article)
    return articles

print(scrape("https://www.nytimes.com/search?query=china+COVID-19")[0] # 顯示第一篇文章字典)

如何使用 BeautifulSoup 循環鏈接並抓取新聞文章的內容

問題描述

1 個解決方案

解決方案1
2 已采納 2020-05-08 18:40:29

如何使用 BeautifulSoup 循環鏈接並抓取新聞文章的內容

問題描述

1 個解決方案

解決方案1 2 已采納 2020-05-08 18:40:29

解決方案1
2 已采納 2020-05-08 18:40:29