[英]How to scrape a data from a dynamic website containing Javascript using Python?
I am trying to scrape data from https://www.doordash.com/food-delivery/chicago-il-restaurants/我正在尝试从https://www.doordash.com/food-delivery/chicago-il-restaurants/抓取数据
The idea is to scrape all the data regarding the different restaurant listings on the website.这个想法是要抓取有关网站上不同餐厅列表的所有数据。 The site is divided into different cities, but I only require restaurant data for Chicago .该网站分为不同的城市,但我只需要Chicago的餐厅数据。
All restaurant listings for the city have to be scraped along with any other relevant data about the respective restaurants (Ex: Reviews, Rating, Cuisine, address, state etc).该城市的所有餐厅列表必须连同有关各个餐厅的任何其他相关数据(例如:评论、评级、菜式、地址、state 等)一起被抓取。 I need to capture all the respective details(currently 4,326 listings) for the city in the Excel.我需要在 Excel 中捕获该城市的所有详细信息(目前有 4,326 个列表)。
I have tried to extract the restaurant name, cuisine, ratings and review inside the class named "StoreCard_root___1p3uN".我试图在名为“StoreCard_root___1p3uN”的 class 中提取餐厅名称、菜系、评级和评论。 But No datas have been displayed.但是没有显示任何数据。 The output is blank. output 为空白。
from selenium import webdriver
chrome_path = r"D:\python project\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.doordash.com/food-delivery/chicago-il-restaurants/")
driver.find_element_by_xpath("""//*[@id="SeoApp"]/div/div[1]/div/div[2]/div/div[2]/div/div[2]/div[1]/div[3]""").click()
posts = driver.find_elements_by_class_name("StoreCard_root___1p3uN")
for post in posts:
print(post.text) ```
you can use the API
url as the data rendered from it actually via XHR
request.您可以使用API
url 作为实际通过XHR
请求从中呈现的数据。
iterate over the API
link below and scrape
whatever you want.遍历下面的API
链接并scrape
您想要的任何内容。
https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset=0 https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset=0
You will just loop over this parameter offset=0
by increasing it +50 each time as each page will shown 50
items till you reach 4300
as it's the last page !您将通过每次增加 +50 来循环此参数offset=0
,因为每页将显示50
项目,直到达到4300
因为它是最后一页! simply by range(0, 4350, 50)
简单地按range(0, 4350, 50)
import requests
import pandas as pd
data = []
for item in range(0, 4350, 50):
print(f"Extracting item# {item}")
r = requests.get(
f"https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset={item}").json()
for item in r['store_data']:
item = (item['name'], item['city'], item['category'],
item['num_ratings'], item['average_rating'], item['average_cost'])
data.append(item)
df = pd.DataFrame(
data, columns=['Name', 'City', 'Category', 'Num Ratings', 'Average Ratings', 'Average Cost'])
df.to_csv('output.csv', index=False)
print("done")
Sample of Output:输出示例:
View Output online: Click Here在线查看输出: 点击这里
Full Data is here: Click Here完整数据在这里: 点击这里
you can use a for loop, after importing the necessary libraries:导入必要的库后,您可以使用 for 循环:
restaurant = []
location = []
ratings = []
card_grids = driver.find_elements_by_class_name("StoreCard_root___1p3uN")
for i in card_grids:
restaurant.append(i.find_element_by_class_name("StoreCard_storeTitle___1tIOi").text)
location.append(i.find_element_by_class_name("StoreCard_storeInfo___3EpLG").text)
s = i.find_elements_by_class_name("StoreCard_storeReviews___8EiRe")
for e in s:
ratings.append(e.text)
I have checked out the API that αԋɱҽԃ αмєяιcαη mentions.我已经检查了 API αūɱҽ αмєяιcαη 提到的。 They also had an endpoint for restaurant info.他们还有一个餐厅信息端点。
URL https://api.doordash.com/v2/restaurant/[restaurantId]/ URL https://api.doordash.com/v2/restaurant/[restaurantId]/
It was working until recently when it started returning {"detail":"Request was throttled."}直到最近它才开始返回 {"detail":"Request was throttled."}
Has anyone had the same issue / found a workaround?有没有人遇到同样的问题/找到解决方法?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.