简体   繁体   中英

How can I scrape JSON data from a HTML page source?

I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."

view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534

I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.

I haven't seen this x:y format before and some searching makes me think it's in a JSON format.

Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (ie work for pages with different isrc values, as each music track has a different one).

My code so far ...

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os

# Access AudioNetwork and search for tracks.

path = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(path)

driver.get("https://www.audionetwork.com/track/searchkeyword")

search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)

music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")

music_link.click()

I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?

Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.

Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp . Let me know if that works for you.

If you wanted to find the value of isrc, you could do:

import json

... # your code here

jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]

replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...

jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.

The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.

from selenium import webdriver
from selenium.webdriver.common.by import By
import json

driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")

search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')

content_as_dict = json.loads(content)

print(content_as_dict['props']['pageProps']['track']['isrc'])

driver.close()
driver.quit()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM