[英]Problems in geting article content while scraping news website using beautiful soup
我正在嘗試從 rss 提要中抓取新聞文章以及標題、描述、URL 和日期等詳細信息。 我沒有按預期在描述列中獲得整個文章內容。 下面是我的代碼。
import requests
from bs4 import BeautifulSoup as bs
url='https://www.business-standard.com/rss/economy-policy-102.rss'
resp= requests.get(url)
soup = bs(resp.content,features='xml')
items= soup.findAll('item')
news_items = []
for item in items:
news_item = {}
news_item['title'] = item.title.text
news_item['description'] = item.description.text
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items,columns=['title','description','link','pubDate'])
df['description'][0]
Output obtained - 'The re-import in the extended period would be without payment of basic customs duty and integrated goods and services tax'
如上所示,我沒有獲得完整的文章內容。 應該做出哪些改變?
RSS 提要不包含文章的全文,您必須打開鏈接並從那里獲取文章。
例如:
import requests
from bs4 import BeautifulSoup
url='https://www.business-standard.com/rss/economy-policy-102.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')
news_items = []
for item in soup.findAll('item'):
news_item = {}
news_item['title'] = item.title.text
news_item['excerpt'] = item.description.text
print(item.link.text)
s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')
news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('data.csv')
創建data.csv
(來自 LibreOffice 的屏幕截圖):
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.