[英]How to scrape <script text/javascript>
so I am trying to figure out how I can possible scrape a javascript tag using regex which I believe might be the easiest way. 因此,我试图弄清楚如何使用正则表达式来抓取javascript标签,我认为这可能是最简单的方法。
The tag looks like: 标签看起来像:
<script type="text/javascript">
var spConfig=newApex.Config({
"attributes": {
"199": {
"id": "199",
"code": "legend",
"label": "Weapons",
"options": [
{
"label": "10",
"priceInGame": "0",
"id": [
]
},
{
"label": "10.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "11",
"priceInGame": "0",
"id": [
"66659"
]
},
{
"label": "11.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "12",
"priceInGame": "0",
"id": [
]
},
{
"label": "12.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "13",
"priceInGame": "0",
"id": [
]
},
{
"label": "4",
"priceInGame": "0",
"id": [
]
},
{
"label": "4.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "5",
"priceInGame": "0",
"id": [
]
},
{
"label": "5.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "6",
"priceInGame": "0",
"id": [
]
},
{
"label": "6.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "7",
"priceInGame": "0",
"id": [
]
},
{
"label": "7.5",
"priceInGame": "0",
"id": [
]
},
{
"label": "8",
"priceInGame": "0",
"id": [
"66672"
]
},
{
"label": "8.5",
"priceInGame": "0",
"id": [
"66673"
]
},
{
"label": "9",
"priceInGame": "0",
"id": [
]
},
{
"label": "9.5",
"priceInGame": "0",
"id": [
"66675"
]
}
]
}
},
"weaponID": "66733",
"chooseText": "Apex Legends",
"Config": {
"includeCoins": false,
}
});
</script>
and I want to scrape all Label 我想刮所有标签
Whaht I tried to do is: 我试图做的是:
for nosto_sku_tag in bs4.find_all('script', {'type': 'text/javascript'}):
try:
test = re.findall('var spConfig = (\{.*}?);', nosto_sku_tag.text.strip())
print(test)
except: # noqa
continue
but it only returned an empty value of []
但它只返回
[]
的空值
so I am here asking what can I do to be able to scrape the labels? 所以我在这里问我该怎么做才能刮标签?
You need to specify the attribute using attr=value
or attrs={'attr': 'value'}
syntax. 您需要使用
attr=value
或attrs={'attr': 'value'}
语法指定属性。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
import json
import re
from ast import literal_eval
from bs4 import BeautifulSoup
if __name__ == '__main__':
html = '''
<script type="text/javascript">
var spConfig=newApex.Config({
"attributes": {
"199": {
"id": "199",
"code": "legend",
"label": "Weapons",
"options": [
{ "label": "10", "priceInGame": "0", "id": [] },
{ "label": "10.5", "priceInGame": "0", "id": [] },
{ "label": "11", "priceInGame": "0", "id": [ "66659" ] },
{ "label": "7.5", "priceInGame": "0", "id": [] },
{ "label": "8", "priceInGame": "0", "id": ["66672"] }
]
}
},
"weaponID": "66733",
"chooseText": "Apex Legends",
"taxConfig": {
"includeCoins": False,
}
});
</script>
'''
soup = BeautifulSoup(html, 'html.parser')
# this one works too
# script = soup.find('script', attrs={'type':'text/javascript'})
script = soup.find('script', type='text/javascript')
js: str = script.text.replace('\n', '')
raw_json = re.search('var spConfig=newApex.Config\(({.*})\);', js, flags=re.MULTILINE).group(1)
# if `"includeCoins": False,` weren't in the JSON,
# you could have used json.loads() but it fails here.
# Yet, ast.literal_eval works fine.
data = literal_eval(raw_json)
labels = [opt['label'] for opt in data['attributes']['199']['options']]
print(labels)
output: 输出:
['10', '10.5', '11', '7.5', '8'] ... some removed for brevity
If you are just looking for the entire row field in the JSON object, use the following; 如果只是在JSON对象中查找整个行字段,请使用以下命令;
("label":) "([^"]+)",
Then if you want to return the actual value, just use 然后,如果要返回实际值,请使用
\2
to pull back the second group 拉第二组
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.