简体   繁体   中英

Scraping data from an interactive highchart.js graph

I'm a mostly a lurker on this platform and try to solve my problems using the answer of already asked questions but I couldn't find a question to my current problem. I try to scrape data from this website website using scrapy. I'm already able to scrape most of the data I need however, there are two interactive highcharts i'd like to have the data from. Picture of first graph

What I tried so far:

  • Extracting the data directly from the html response, but I can only access the axis values so this approach did not work work.
  • Extract data by finding the API call with the dev Tools in the browser, similar to this approach. However the only XHR visible is called footprint and does not contain any response. In the initiator tab of the footproint is a Request callstack pointing to https://crowdcircus.com/js/app.js?id=6677107ebf6c7824be09 but I don't know if this helps anything since I'm really new to json and webscraping.

A hint and/or explanation how to scrape this chart data from this website would be much appreciated.

To see the graphs you have to login here . I've created a throwaway account with: email: mivop31962@aranelab.com , password: 12345 so you can see the data.


Update:

Sebastians answer pointed me to the right direction. I ended up using scarpy_splash which allows to execute javascript code with lua. With the code underneath I'm able to scrape all the data I needed.

        LUA_SCRIPT = """
            function main(splash)
                 
                 -- Get cookies from previous session
                 splash:init_cookies(splash.args.cookies)
                 assert(splash:go(splash.args.url))
                 assert(splash:wait(0.5))
                 
                 -- Extract data from page
                 -- Read amount of variables in second table
                 table_2_no_series = splash:evaljs('Highcharts.charts[1].series.length')
     
                 -- If second table has more variable then one, get this data aswell 
                 if (table_2_no_series==2) or (table_2_no_series==3) then
                    table_2_y1_data = splash:evaljs('Highcharts.charts[1].series[0].yData')
                    table_2_y1_name = splash:evaljs('Highcharts.charts[1].series[0].name')
                 end
                 if (table_2_no_series==3) then
                    table_2_y3_data = splash:evaljs('Highcharts.charts[1].series[2].yData')
                    table_2_y3_name = splash:evaljs('Highcharts.charts[1].series[2].name')  
                 end
                 
                 return {
                          -- Extract webiste title
                         title = splash:evaljs('document.title'),
                          -- Extract first table data
                         table_1_name = splash:evaljs('Highcharts.charts[0].title.textStr'),
                          -- Extract Timestamps
                         table_1_x = splash:evaljs('Highcharts.charts[0].series[0].xAxis.categories'),
                          -- Extract Finanzierungsstand
                         table_1_y_data = splash:evaljs('Highcharts.charts[0].series[1].yData'),
                         table_1_y_name = splash:evaljs('Highcharts.charts[0].title.textStr'),
         
                         -- Extract second table data
                         table_2_y1_data,
                         table_2_y1_name, 
                         table_2_y3_data,
                         table_2_y3_name,
                         cookies = splash:get_cookies(),
                     }
            end
         """
        SCRAPY_ARGS = {
             'lua_source': LUA_SCRIPT, 
             'cookies' : self.cookies
             }

        # Look for json data if we sucessfully logged in
        yield SplashRequest(url=response.url,
                            callback=self.parse_highchart_data,
                            endpoint='execute', args=SCRAPY_ARGS,
                            session_id="foo")

Note : The highchart api also has a .getCSV which exports the data in csv format. However it seems like this site blocked this function.

It's not exactly a scrape/fetching approach, but from the Highcharts site, you can see the whole chart config using the web console tool. Try to use:

console.log(Highcharts.charts) which shows the array of the rendered charts on the page. Next, go to particular chart -> series -> data, for example:

console.log(Highcharts.charts[0].series[1].data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM