简体   繁体   中英

Scrape infinite scrolling websites with scrapy

I want to crawl earning call transcripts from the website https://www.seekingalpha.com with scrapy.

The spider should behave as followed: 1) In the beginning a list of company codes ccodes is provided. 2) For each company all available transcript urls are parsed from https://www.seekingalpha.com/symbol/A/earnings/transcripts . 3) From each transcript url the associated content is parsed.

The difficulty is that https://www.seekingalpha.com/symbol/A/earnings/transcripts contain an infinite scrolling mechanism. Therefore, the idea is to individually iterate through the json files https://www.seekingalpha.com/symbol/A/earnings/more_transcripts?page=1 with page=1,2,3.. that are called by javascript. The json files contain the keys html and count . The key html should be used to parse transcript urls, the key count should be used to stop when there are no further urls. The criteria for that is count=0 .

Here is my code so far. I have already managed to successfully parse the first json page for each company code. But I have no idea how I could iterate through the json files and stop when there are no more urls.

import scrapy
import re
import json
from scrapy.http import FormRequest
from scrapy.selector import Selector

class QuotesSpider(scrapy.Spider):

    name = "quotes"
    start_urls = ["https://seekingalpha.com/account/login"]
    custom_settings = { 'DOWNLOAD_DELAY': 2 }

    loginData = {
        'slugs[]': "",
        'rt': "",
        'user[url_source]': 'https://seekingalpha.com/account/login',
        'user[location_source]': 'orthodox_login',
        'user[email]': 'abc',
        'user[password]': 'xyz'
    }

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response = response,
            formdata = self.loginData,
            formid = 'orthodox_login',
            callback = self.verify_login
            )

    def verify_login(self, response):
        pass
        return self.make_initial_requests()

    def make_initial_requests(self):
        ccodes = ["A", "AB", "GOOGL"]
        for ccode in ccodes:
            yield scrapy.Request(
                url = "https://seekingalpha.com/symbol/"+ccode+"/earnings/more_transcripts?page=1",
                callback = self.parse_link_page,
                meta = {"ccode": ccode, "page": 1}
                )   

    def parse_link_page(self, response):
        ccode = response.meta.get("ccode")
        page = response.meta.get("page")
        data = json.loads(response.text)
        condition = "//a[contains(text(),'Results - Earnings Call Transcript')]/@href"
        transcript_urls = Selector(text=data["html"]).xpath(condition).getall()
        for transcript_url in transcript_urls:
            yield scrapy.Request(
                url = "https://seekingalpha.com"+transcript_url,
                callback = self.save_contents,
                meta = {"ccode": ccode}
                )

    def save_contents(self, response):
        pass

You should be able to execute the code without authentification. The expected result is that all urls from https://www.seekingalpha.com/symbol/A/earnings/transcripts are crawled. Therefore it is necessary to access https://www.seekingalpha.com/symbol/A/earnings/more_transcripts?page=page with page = 1,2,3.. until all available urls are parsed.

Adding the below after looping through the transcript_urls seems to work. It yields a new request with a callback to parse_link_page if there were transcript_urls found on the current page.

        if transcript_urls:
            next_page = page + 1
            parsed_url = urlparse(response.url)
            new_query = urlencode({"page": next_page})
            next_url = urlunparse(parsed_url._replace(query=new_query))
            yield scrapy.Request(
                url=next_url,
                callback=self.parse_link_page,
                meta={"ccode": ccode, "page": next_page},
            )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM