简体   繁体   中英

How to scrape data from Highcharts using Python

I am trying to scrape data from the chart at https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290 . I tried accessing the data using the respective xpath of the data in the boxes, but it doesn't seem to work.

I tried using Scrapy:

date = response.xpath('//*[@id="highcharts-0"]/div/span/b[1]').get()
market_value =  response.xpath('//*[@id="highcharts-0"]/div/span/b[1]').get()
club = response.xpath('//*[@id="highcharts-0"]/div/span/b[3]').get()
age = response.xpath('//*[@id="highcharts-0"]/div/span/b[4]').get()

How can I scrape all the data from the chart using Scrapy or Selenium?

This data is being rendered on the client (browser) after consuming an inline JS on the HTML body.

You need regex if you're about to use scrapy

eg (not tested)

import re
import json

body = response.body()
data = re.findall(r"(?<=\'series\'\:).*?}}]}]", body)

if not data:
   return None

data = json.loads(data[0])
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")


driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get(url)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item for item in temp]
print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM