简体   繁体   中英

Scrape dynamic content in Python

I'm new to Python Scrapy module. I'm trying to scrape the restaurants' info on https://munchado.com/search?sst=a&fb=m&vt=s&svt=l&in=New%20York%2C%20NY%2C%20USA&at=c&lat=40.7127&lng=-74.0059&p=0&srb=r&srt=d&sq=american&sdt=ft&ovt=restaurant&d=0&st=d

While I have some successful experience in scraping on other webpages, this one is really a trouble. It seems that the restaurants' info is loaded automatically when you deliver a search request. By that I mean the info is not written in the webpage's source code, and could possibly come from the company's inside server or something. And the directories changes by time. For example, if you search in the evening, some directories change their name from "div class='t-has-deals'" to "div class='t-closed-now'".

So my question is: is it still possible to scrape info from such webpages. If this matter belongs to scraping dynamic content, is there a universal way to solve this? Thank you so much.

When dealing with dynamic sites its tough to scrape the data than the normal way. But first we have identify how the data is rendered in the pages dynamically. The data might be rendering in following ways:

  1. From a javascript file which contains the data.
  2. From an ajax response.
  3. From websocket response. In this case we have to first send a relevant message to the server which gives us a response which might contain the data.
  4. From an api response.

    There will be more ways than I mentioned. In your case the data is obatined from the this api_request_url . and image below shows the form_data which we need to provide during the request to the api_request_url .

    FORMDATA

Which give you a json_response show below

json_response

which contains the data you need. If you change the parameter in the form_data you will get the data accordingly.

I'm not sure about scrapy, so I can't help you there, but you could try selenium . The code below should work with dynamically generated content.

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions

driver = webdriver.Firefox()
url = "www.google.com"
driver.get(url)

# If it takes a certain amount of time for the content to be created you can
# use time.sleep
time.sleep(5)

# However if you want to wait for specified content to appear, you 
# can use the following
try:
    WebDriverWait(driver, 10).until(
         expected_conditions.presence_of_element_located(
                            (By.ID, "id-of-your-element")
                                    )
finally:
    driver.quit()

# then you can pull your html
html = driver.page_source

Selenium has great docs too. Most of the code here can actually be found in the docs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM