简体   繁体   English

如何从 HTML 页面源中抓取 JSON 数据?

[英]How can I scrape JSON data from a HTML page source?

I'm trying to pull some data from an online music database.我正在尝试从在线音乐数据库中提取一些数据。 In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."特别是,我想提取您可以使用 CTRL+F 找到的这些数据——“isrc”:“GB-FFM-19-0853”。

view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534查看源代码:https ://www.audionetwork.com/browse/m/track/purple-beat_1008534

I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.我正在使用 Python 和 Selenium,并尝试通过标签、xpath 和 id 等方式定位数据,但似乎没有任何效果。

I haven't seen this x:y format before and some searching makes me think it's in a JSON format.我以前没有见过这种 x:y 格式,一些搜索让我觉得它是 JSON 格式。

Is there a way to grab that isrc data via Selenium?有没有办法通过 Selenium 获取 isrc 数据? I'd need the approach to be generic (ie work for pages with different isrc values, as each music track has a different one).我需要通用的方法(即适用于具有不同 isrc 值的页面,因为每个音乐曲目都有不同的)。

My code so far ...到目前为止我的代码...

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os

# Access AudioNetwork and search for tracks.

path = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(path)

driver.get("https://www.audionetwork.com/track/searchkeyword")

search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)

music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")

music_link.click()

I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?我知道我需要更好地等待/可能是代码的其他问题,但是关于如何获取该 ISRC 号码的任何想法?

Yes, this is JSON format.是的,这是 JSON 格式。 It's actually JSON wrapped inside of a HTML script tag.它实际上是包裹在 HTML 脚本标签内的 JSON。 It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.它本质上是一个“键”:“值”对-因此您概述的特定内容(“isrc”:“GB-FFM-19-08534”)具有 isrc 键,值为 GB-FFM-19-08534 .

Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp . Python 有一个解析 JSON 的库,我想你可能想要这个 - https://www.w3schools.com/python/gloss_python_json_parse.asp Let me know if that works for you.让我知道这是否适合你。

If you wanted to find the value of isrc, you could do:如果你想找到 isrc 的值,你可以这样做:

import json

... # your code here

jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]

replace someValueHere with the json string that you're parsing through and that should help.用您正在解析的 json 字符串替换 someValueHere,这应该会有所帮助。 I think isrc is nested though, so it might not be quite that simple.我认为 isrc 是嵌套的,所以它可能不是那么简单。 I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc.我不认为你可以在 python 中只做 jsonString["track.isrc"],但我不确定......你正在寻找的路径是 props.pageProps.track.isrc。 You may have to assign a variable per layer...您可能必须为每层分配一个变量...

jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.您想将整个脚本提取为 JSON 数据(可以在 python 中作为字典读取)并搜索“isrc”参数。

The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.以下代码使用 selenium 来提取页面内的脚本内容,将其解析为 json 并将“isrc”值打印到终端。

from selenium import webdriver
from selenium.webdriver.common.by import By
import json

driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")

search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')

content_as_dict = json.loads(content)

print(content_as_dict['props']['pageProps']['track']['isrc'])

driver.close()
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM