[英]Scraping using BS4 with time
我對 python 和 BS4 比較陌生,我想從特定網站上抓取新聞。
我的目標是根據今天的日期獲取父 URL 的消息,但是當我嘗試這樣做時,它返回了一個空白的 csv 文件。 請建議我如何解決或改進! 提前致謝
這是我的代碼:
from bs4 import BeautifulSoup
import requests, re, pprint
from datetime import date
import csv
today = date.today()
d2 = today.strftime("%B %d, %Y")
result = requests.get('https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/')
soup = BeautifulSoup(result.content, "lxml")
urls =[]
titles = []
contents = []
#collect all links from 'latest news' into a list
for item in soup.find_all("a"):
url = item.get("href")
market_intelligence_pattern = re.compile("^/marketintelligence/en/news-insights/latest-news-headlines/.*")
if re.findall(market_intelligence_pattern, url):
if re.findall(market_intelligence_pattern, url)[0] == "/marketintelligence/en/news-insights/latest-news-headlines/index":
continue
else:
news = "https://www.spglobal.com/"+re.findall(market_intelligence_pattern, url)[0]
urls.append(news)
else:
continue
newfile = open('output.csv','w',newline='')
outputWriter = csv.writer(newfile)
#extract today's articles = format: date,title,content
for each in urls:
individual = requests.get(each)
soup2 = BeautifulSoup(individual.content, "lxml")
date = soup2.find("ul",class_="meta-data").text.strip() #getting the date
#print(date)
if d2 != date: #today's articles only
continue
else:
title = soup2.find("h2", class_="article__title").text.strip() #getting the title
titles.append(title)
#print(title)
precontent = soup2.find("div", class_="wysiwyg-content") #getting content
content = precontent.findAll("p")
indi_content = []
for i in content:
indi_content.append(i.text)
#contents.append(content)
outputWriter.writerow([date,title,indi_content])
也許這會推動你朝着正確的方向前進:
from datetime import date
import requests
from bs4 import BeautifulSoup
result = requests.get('https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/')
soup = BeautifulSoup(result.content, "lxml").find_all("a")
for item in soup:
if item['href'].startswith("/marketintelligence/en/news-insights/latest") and not item['href'].endswith("index"):
article_soup = BeautifulSoup(requests.get(f"https://spglobal.com{item['href']}").content, "lxml")
article_date = article_soup.find("li", {"class": "meta-data__date"})
if article_date.getText(strip=True) == str(date.today().strftime("%d %b, %Y")):
print(article_soup.find("h2", {"class": "article__title"}).getText(strip=True))
else:
continue
如果日期與今天的日期匹配,則打印文章標題。
輸出:
Houston, America's fossil fuel capital, braces for the energy transition
Blackstone to sell BioMed for $14.6B; Simon JV deal talks for J.C. Penney stall
Next mega-turbine is coming but 'the sky has a limit,' says MHI Vestas CEO
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.