[英]How to scrape Google News articles content from Google News RSS?
[英]How to loop over links and scrape the content of news articles with BeautifulSoup
我是 Python 的新手,我想從這個頁面獲取所有新聞文章的內容和標題: https://www.nytimes.com/search?query=china+COVID-19
但是,我當前代碼的 output 將 10 篇文章中的所有段落存儲到 1 個列表中。 我想知道如何將每個段落存儲到一個字典中,這是它所屬的文章,並將所有字典保存到 1 個列表中?
任何幫助將不勝感激!
import requests
from bs4 import BeautifulSoup
import json
response=requests.get('https://www.nytimes.com/search?query=china+COVID-19')
response.encoding='utf-8'
soupe=BeautifulSoup(response.text,'html.parser')
links = soupe.find_all('div', class_='css-1i8vfl5')
pagelinks = []
for link in links:
url = link.contents[0].find_all('a')[0]
pagelinks.append('https://www.nytimes.com'+url.get('href'))
articles=[]
for i in pagelinks:
response=requests.get(i)
response.encoding='utf-8'
soupe=BeautifulSoup(response.text,'html.parser')
for p in soupe.select('section.meteredContent.css-1r7ky0e div.css-53u6y8'):
articles.append(p.text.strip())
print('\n'.join(articles))
import urllib3
from bs4 import BeautifulSoup as bs
def scrape(url):
http = urllib3.PoolManager()
response = http.request("GET", url)
soup_page = bs(response.data, 'lxml') # you have to install lxml package
# pip install lxml
articles = []
containers = soup_page.findAll("div", attrs={'class': "css-1i8vfl5"})
for container in containers:
title = container.find('h4', {'class':'css-2fgx4k'}).text.strip()
description = container.find('p', {'class':'css-16nhkrn'})
article = {
'title':title,
'description':description
}
articles.append(article)
return articles
print(scrape("https://www.nytimes.com/search?query=china+COVID-19")[0] # 顯示第一篇文章字典)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.