[英]Web scraping news page with a "load more"
我正在嘗試抓取這個新聞網站“https://inshorts.com/en/read/national”,而我只是得到顯示文章的結果,我需要網站上包含該詞的所有文章(例如“COVID-19”),並且不必使用“加載更多”按鈕。
這是我提供當前文章的代碼:
import requests
from bs4 import BeautifulSoup
import pandas as pd
dummy_url="https://inshorts.com/en/read/badminton"
data_dummy=requests.get(dummy_url)
soup=BeautifulSoup(data_dummy.content,'html.parser')
urls=["https://inshorts.com/en/read/national"]
news_data_content,news_data_title,news_data_category,news_data_time=[],[],[],[]
for url in urls:
category=url.split('/')[-1]
data=requests.get(url)
soup=BeautifulSoup(data.content,'html.parser')
news_title=[]
news_content=[]
news_category=[]
news_time = []
for headline,article,time in zip(soup.find_all('div', class_=["news-card-title news-right-box"]),
soup.find_all('div',class_=["news-card-content news-right-box"]),
soup.find_all('div', class_ = ["news-card-author-time news-card-author-time-in-title"])):
news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)
news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)
news_time.append(time.find('span', clas=["date"]))
news_category.append(category)
news_data_title.extend(news_title)
news_data_content.extend(news_content)
news_data_category.extend(news_category)
news_data_time.extend(news_time)
df1=pd.DataFrame(news_data_title,columns=["Title"])
df2=pd.DataFrame(news_data_content,columns=["Content"])
df3=pd.DataFrame(news_data_category,columns=["Category"])
df4=pd.DataFrame(news_data_time, columns=["time"])
df=pd.concat([df1,df2,df3,df4],axis=1)
def name():
a = input("File Name: ")
return a
b = name()
df.to_csv(b + ".csv")
您可以使用此示例如何模擬單擊Load More
按鈕:
import re
import requests
from bs4 import BeautifulSoup
url = "https://inshorts.com/en/read/national"
api_url = "https://inshorts.com/en/ajax/more_news"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}
# load first page:
html_doc = requests.get(url, headers=headers).text
min_news_id = re.search(r'min_news_id = "([^"]+)"', html_doc).group(1)
pages = 10 # <-- here I limit number of pages to 10
while pages:
soup = BeautifulSoup(html_doc, "html.parser")
# search the soup for your articles here
# ...
# here I just print the headlines:
for headline in soup.select('[itemprop="headline"]'):
print(headline.text)
# load next batch of articles:
data = requests.post(api_url, data={"news_offset": min_news_id}).json()
html_doc = data["html"]
min_news_id = data["min_news_id"]
pages -= 1
打印前 10 頁的新聞標題:
...
Moeen has done some wonderful things in Test cricket: Root
There should be an evolution in player-media relationship: Federer
Swiggy in talks to raise over $500 mn at $10 bn valuation: Reports
Tesla investors urged to reject Murdoch, Kimbal Musk's re-election
Doctor dies on Pune-Mumbai Expressway when rolls of paper fall on his car
2 mothers name newborn girls after Cyclone Gulab in Odisha
100 US citizens, permanent residents waiting to leave Afghanistan
Iran's nuclear programme has crossed all red lines: Israeli PM
@AndrejKesely 你能幫我處理這個頁面嗎https://www.microplay.cl/productos/juguetes?categorias=pop ,我需要在屏幕上顯示所有 Funkos 的名稱。 (帶有“加載更多”的頁面)提前致謝。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.