[英]Convert HTML to JSON using PUP / JQ & extract data to a variable
我有 HTML 里面的数据,我正在尝试获取匹配项。 我正在使用 bash 来实现这一点,因为它不可能做到这一点,我将 HTML 运行到 PUP(如 StackOverflow 上的推荐此处),然后使用 PUP 我提取了一些模式,但是我留下了带有数据的大型 json 我不需要,然后我运行 sed 命令来删除我不需要的行。 我试图找到一种使用 JQ 只选择我需要的数据的方法,这样我就不需要运行 SED 命令来删除不需要的行。
所以我运行命令:-
cat test.html | pup 'div.scene json{}' > out.json
生成如下。
[
{
"children": [
{
"children": [
{
"class": "icon-new active",
"tag": "div"
},
{
"children": [
{
"children": [
{
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"class": "last",
"tag": "span"
}
],
"class": "sample-picker clearfix",
"data-trackid": "bhangra-tracking-id",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-overlay",
"tag": "div"
},
{
"children": [
{
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title"
}
],
"data-trackid": "bhangra-tracking-id ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-image",
"tag": "div"
},
{
"children": [
{
"children": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"class": "scene-card-title",
"tag": "div"
},
{
"children": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"class": "model-names",
"tag": "div"
},
{
"tag": "time",
"text": "September 08, 2018"
},
{
"children": [
{
"children": [
{
"class": "label-left-box",
"tag": "span",
"text": "Website Name"
},
{
"class": "label-text",
"tag": "span",
"text": "Website URL"
}
],
"class": "collection label-small",
"data-trackid": "scene-card-collection",
"href": "/bhangra/main/id/url/",
"tag": "a",
"title": "Website URL"
},
{
"class": "label-hd ",
"tag": "span"
},
{
"children": [
{
"children": [
{
"class": "icons like-icon",
"tag": "span"
},
{
"class": "like-amount",
"tag": "var",
"text": "0"
}
],
"class": "likes",
"tag": "span"
},
{
"children": [
{
"class": "icons dislike-icon",
"tag": "span"
},
{
"class": "dislike-amount",
"tag": "var",
"text": "0"
}
],
"class": "dislikes",
"tag": "span"
}
],
"class": "label-rating",
"tag": "span"
}
],
"class": "bhangra-information",
"tag": "div"
}
],
"class": "scene-card-info",
"tag": "div"
}
],
"class": "bhangra-card scene ",
"tag": "div"
}
]
然后我使用 JQ 返回一些我想要的细节。
cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'
这是返回下面。
{
"1": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"2": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"date": "September 08, 2018"
}
有了上面的下一个 Album2 也有 1 & 2 后跟日期的键,这导致语法无效,我无法定位我想要的数据,因为键都是一样的。
为了解决这个问题,我运行了一堆 sed 命令来删除上面不需要的行。
以下是我希望从初始 jq 查询返回的内容,但不确定如何返回此特定数据。
{
"1" : {
"album": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist Name",
"date": "September 08, 2018"
},
"2" : {
"album": "Album1 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist1 Name",
"date": "September 08, 2018"
},
"3" : {
"album": "Album2 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist2 Name",
"date": "September 09, 2018"
}
}
更新 编辑 11/09/2018
所以我在这方面取得了一些小小的进展,使用下面的查询我设法拉回了我需要的数据,但它们仍然是单独的数组。
cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'
这输出了下面的内容,这让我更接近我想要的(上一个例子)。
{
"WTF": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"text": "Album Title"
}
"WTF": "Artist Name",
"href": "/bhangra/profile/view/2842847/artist-name/",
"text": "Artist Name"
}
{
"WTF": "Null",
"href": "Null",
"text": "September 08, 2018"
}
输入 JSON 和据说是所需输出的 JSON 之间的连接似乎很脆弱,但是解决使用顺序编号的键标记对象的问题的一种方法是使用以下函数:
def tag(s):
reduce s as $x ({n:0, o:{}} ;
.n += 1
| .o += { (.n|tostring): $x})
| .o;
在这里, s
应该是一个 JSON 实体流,结果是一个带有键“1”、“2”等的单个对象。
所以现在的任务是生成所需对象的流。 由于不清楚您想要什么,以下内容可能被视为说明性的。
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(..
| objects
| select(has("title") and (has("children")|not) and .title == "Album Title")
+ $date )
{
"1": {
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"2": {
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"3": {
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"4": {
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"5": {
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"6": {
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"7": {
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title",
"date": "September 08, 2018"
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.