简体   繁体   中英

how to scrape websites that have loaders?

i'm trying to scrape the website that contains loading screens. when i browse the website it shows loading.. for a sec and then it loads up. But the problem is when i try to scrape it using scrapy it gives me nothing (probably because of that loading). can i solve the problem using scrapy or should i use some other tools? here's the link to the website if you wanna see https://www.graana.com/project/601/lotus-lake-towers

网络控制台 As it is sending a GET request to get information about the property, you should mimic the same in your code. (You can observe the GET call under console -> Network -> XHR )

    # -*- coding: utf-8 -*-
    import scrapy


    class GranaSpider(scrapy.Spider):
        name = 'grana'
        allowed_domains = 'www.graana.com'
        start_urls = ['https://www.graana.com/api/area/slug/601']

        def parse(self, response):
    #        for url in allurlList:
            scrapy.http.Request(response.url, method='GET' , dont_filter=False)
            print(response.body)
#convert json response to array and save to your storage system

Output is in json format, convert it to your convenience.

在此处输入图像描述

I know this question is old and already answered but I wanted to share my solution after encountering a similar problem. The accepted answer was not helpful to me because I was not using scrapy.

Problem

Scrape websites that first display a loading page and then displays the actual content.

Here's an example of such a website : 显示网站加载页面动画的 GIF

The requests library will not work for such websites. In my experience, request.get(URL, headers=HEADERS) simply times out.

Solution

Use Selenium .

  • First you need to know approximately how long the loading page animation lasts. In the above website, it takes around 3 seconds.
  • The trick is to simply sleep your program for the duration of the animation after navigating to the website with driver.get(URL) .
  • By the time the program finishes sleeping, the loading animation will be over so we can safely extract the HTML of the actual page content using driver.page_source .
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

# the following options are only for setup purposes
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=chrome_options)

URL = "https://www.myjob.mu/ShowResults.aspx?Keywords=&Location=&Category=39&Recruiter=Company&SortBy=MostRecent"

driver.get(URL)
time.sleep(5) # any number > 3 should work fine
html = driver.page_source
print(html)

Beautifulsoup library can then be used for parsing the html.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM