how to scrape websites that have loaders?

Question

i'm trying to scrape the website that contains loading screens. when i browse the website it shows loading.. for a sec and then it loads up. But the problem is when i try to scrape it using scrapy it gives me nothing (probably because of that loading). can i solve the problem using scrapy or should i use some other tools? here's the link to the website if you wanna see https://www.graana.com/project/601/lotus-lake-towers

Answer 1

As it is sending a GET request to get information about the property, you should mimic the same in your code. (You can observe the GET call under console -> Network -> XHR )

    # -*- coding: utf-8 -*-
    import scrapy


    class GranaSpider(scrapy.Spider):
        name = 'grana'
        allowed_domains = 'www.graana.com'
        start_urls = ['https://www.graana.com/api/area/slug/601']

        def parse(self, response):
    #        for url in allurlList:
            scrapy.http.Request(response.url, method='GET' , dont_filter=False)
            print(response.body)
#convert json response to array and save to your storage system

Output is in json format, convert it to your convenience.

Answer 2

I know this question is old and already answered but I wanted to share my solution after encountering a similar problem. The accepted answer was not helpful to me because I was not using scrapy.

Problem

Scrape websites that first display a loading page and then displays the actual content.

Here's an example of such a website :

The requests library will not work for such websites. In my experience, request.get(URL, headers=HEADERS) simply times out.

Solution

Use Selenium .

First you need to know approximately how long the loading page animation lasts. In the above website, it takes around 3 seconds.
The trick is to simply sleep your program for the duration of the animation after navigating to the website with driver.get(URL) .
By the time the program finishes sleeping, the loading animation will be over so we can safely extract the HTML of the actual page content using driver.page_source .

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

# the following options are only for setup purposes
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=chrome_options)

URL = "https://www.myjob.mu/ShowResults.aspx?Keywords=&Location=&Category=39&Recruiter=Company&SortBy=MostRecent"

driver.get(URL)
time.sleep(5) # any number > 3 should work fine
html = driver.page_source
print(html)

Beautifulsoup library can then be used for parsing the html.

how to scrape websites that have loaders?

Question

2 answers

solution1
0 2019-10-23 09:11:20

solution2
0 2022-09-17 14:17:01

Problem

Solution

how to scrape websites that have loaders?

Question

2 answers

solution1 0 2019-10-23 09:11:20

solution2 0 2022-09-17 14:17:01

Problem

Solution

solution1
0 2019-10-23 09:11:20

solution2
0 2022-09-17 14:17:01