简体   繁体   English

如何使用 Python 从包含 Javascript 的动态网站中抓取数据?

[英]How to scrape a data from a dynamic website containing Javascript using Python?

I am trying to scrape data from https://www.doordash.com/food-delivery/chicago-il-restaurants/我正在尝试从https://www.doordash.com/food-delivery/chicago-il-restaurants/抓取数据

The idea is to scrape all the data regarding the different restaurant listings on the website.这个想法是要抓取有关网站上不同餐厅列表的所有数据。 The site is divided into different cities, but I only require restaurant data for Chicago .该网站分为不同的城市,但我只需要Chicago的餐厅数据。

All restaurant listings for the city have to be scraped along with any other relevant data about the respective restaurants (Ex: Reviews, Rating, Cuisine, address, state etc).该城市的所有餐厅列表必须连同有关各个餐厅的任何其他相关数据(例如:评论、评级、菜式、地址、state 等)一起被抓取。 I need to capture all the respective details(currently 4,326 listings) for the city in the Excel.我需要在 Excel 中捕获该城市的所有详细信息(目前有 4,326 个列表)。

I have tried to extract the restaurant name, cuisine, ratings and review inside the class named "StoreCard_root___1p3uN".我试图在名为“StoreCard_root___1p3uN”的 class 中提取餐厅名称、菜系、评级和评论。 But No datas have been displayed.但是没有显示任何数据。 The output is blank. output 为空白。


from selenium import webdriver

chrome_path = r"D:\python project\chromedriver.exe"

driver = webdriver.Chrome(chrome_path)

driver.get("https://www.doordash.com/food-delivery/chicago-il-restaurants/")

driver.find_element_by_xpath("""//*[@id="SeoApp"]/div/div[1]/div/div[2]/div/div[2]/div/div[2]/div[1]/div[3]""").click()

posts = driver.find_elements_by_class_name("StoreCard_root___1p3uN")

for post in posts:
    print(post.text) ```


you can use the API url as the data rendered from it actually via XHR request.您可以使用API url 作为实际通过XHR请求从中呈现的数据。

iterate over the API link below and scrape whatever you want.遍历下面的API链接并scrape您想要的任何内容。

https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset=0 https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset=0

You will just loop over this parameter offset=0 by increasing it +50 each time as each page will shown 50 items till you reach 4300 as it's the last page !您将通过每次增加 +50 来循环此参数offset=0 ,因为每页将显示50项目,直到达到4300因为它是最后一页! simply by range(0, 4350, 50)简单地按range(0, 4350, 50)

import requests
import pandas as pd

data = []
for item in range(0, 4350, 50):
    print(f"Extracting item# {item}")
    r = requests.get(
        f"https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset={item}").json()
    for item in r['store_data']:
        item = (item['name'], item['city'], item['category'],
                item['num_ratings'], item['average_rating'], item['average_cost'])
        data.append(item)

df = pd.DataFrame(
    data, columns=['Name', 'City', 'Category', 'Num Ratings', 'Average Ratings', 'Average Cost'])
df.to_csv('output.csv', index=False)
print("done")

Sample of Output:输出示例:

在此处输入图片说明

View Output online: Click Here在线查看输出: 点击这里

Full Data is here: Click Here完整数据在这里: 点击这里

you can use a for loop, after importing the necessary libraries:导入必要的库后,您可以使用 for 循环:

    restaurant = []
    location = []
    ratings = []
    

    card_grids = driver.find_elements_by_class_name("StoreCard_root___1p3uN")
    for i in card_grids:
    restaurant.append(i.find_element_by_class_name("StoreCard_storeTitle___1tIOi").text)
    location.append(i.find_element_by_class_name("StoreCard_storeInfo___3EpLG").text) 
    s = i.find_elements_by_class_name("StoreCard_storeReviews___8EiRe")
    for e in s:
        ratings.append(e.text)

I have checked out the API that αԋɱҽԃ αмєяιcαη mentions.我已经检查了 API αūɱҽ αмєяιcαη 提到的。 They also had an endpoint for restaurant info.他们还有一个餐厅信息端点。

URL https://api.doordash.com/v2/restaurant/[restaurantId]/ URL https://api.doordash.com/v2/restaurant/[restaurantId]/

It was working until recently when it started returning {"detail":"Request was throttled."}直到最近它才开始返回 {"detail":"Request was throttled."}

Has anyone had the same issue / found a workaround?有没有人遇到同样的问题/找到解决方法?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM