简体   繁体   中英

How can I __scrape__ all the information from a page that uses javascript to expand the content

I am trying to scrape a page that have a list of elements and at the bottom a expand button that increases the list. It uses a onclick event to expand and I don't know how to activate it. I'm trying to use scrapy-splash since I read it might work, but I can't make it function properly.

What I am currently trying to do is something like this

    def expand_page(self, response):
    expand = response.css('#maisVagas')
    page = response.request.url
    if len(expand) > 0:
        expand = expand.xpath("@onclick").extract()
        yield SplashRequest(url=page, callback=self.expand_page, endpoint='execute',
                            args={'js_source': expand[0], "wait": 0.5})
    else:
        yield response.follow(page, self.open_page)

Even though it's in portuguese, if it helps as a reference the site I'm trying to scrape is this: https://www.vagas.com.br/vagas-em-rio-de-janeiro . The expand button is the blue button in the bottom of the page and it's inspect shows this result.

<a data-grupo="todasVagas" data-filtro="pagina" data-total="16" data-url="/vagas-em-rio-de-janeiro?c%5B%5D=Rio+de+Janeiro&amp;pagina=2" class="btMaisVagas btn" id="maisVagas" onclick="ga('send', 'event', 'Pesquisa', 'anuncios');" href="#" style="pointer-events: all; cursor: pointer;">mostrar mais vagas</a>

It's not necessary to use Splash, if you look at the network tools of chromedevtools. It's making a get HTTP request with some parameters. This is called re-engineering the HTTP requests and is preferable to using splash/selenium. Particularly if you're scraping a lot of data.

单击页面上的按钮会显示此 XHR

复制请求

In cases of re-engineering the request copying the BASH request and putting this into curl.trillworks.com. This gives me a nice formated headers, parameters and cookies for that particular request. I usually play about with this HTTP request using the requests python package. In this case, the simplest HTTP request is one where you just have to pass the parameters and not the headers.

这是参数,注意页码

If you look on the right hand side you have headers and parameters. Using the reuqests package I figured out that you only need to pass the page parameters to get the information you needed.

params = (
    ('c[]', 'Rio de Janeiro'),
    ('pagina', '2'),
    ('_', '1596444852311'),
)

You can change the page number to get the next 40 items worth of content. You also know there is 590 items on this page.

This is for the second page.

So as a minimal example of this in Scrapy

Code Example

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['vagas.com.br']


    data = {
    'c[]': 'Rio de Janeiro',
    'pagina': '2',
    '_':'1596444852311'}


    def start_requests(self):
        url = 'https://www.vagas.com.br/vagas-em-rio-de-janeiro'
        yield scrapy.Request(url=url,callback=self.parse,meta={'data':self.data})
    def parse(self, response):
        card = response.xpath('//li[@class="vaga even "]')
        print(card)

Explanation

Using start_requests to build the first URL, we use the meta argument and pass a dictionary called data and give it the value of our parameters to the HTTP request. This grabs the HTML for the next 40 items of the page when you click the button.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM