以JSON格式抓取内容-Python

Question

I am trying to scrape pages like this using Python 3.5. 我正在尝试使用Python 3.5刮擦此类页面。 I have scraped its content using BeautifulSoup. 我已使用BeautifulSoup抓取了其内容。 I have a problem in scraping the number of sizes. 我在刮擦尺寸时遇到问题。 In this specific page the number of sizes is 9 (FR 80 A,FR 80 B,FR 80 C etc). 在此特定页面中，尺码数量为9（FR 80 A，FR 80 B，FR 80 C等）。 I suppose this information is in json format. 我想这些信息是json格式的。 I am trying to use json package but I can't find the 'start' and 'end'. 我正在尝试使用json包，但找不到“开始”和“结束”。 My code looks like this: 我的代码如下所示：

import requests
import json

page = requests.get('https://www.laperla.com/fr/en/cfiplm000566-bgw532.html')
content = page.text    
start = content.find('spConfig') + ...
end = ...    
data = json.loads(content[start:end])
sizes = data['attributes']['179']['options']
print(len(sizes))

The correct output should be '9', since there are 9 sizes. 正确的输出应为“ 9”，因为有9种尺寸。 I don't want to use selenium or such packages. 我不想使用硒或此类软件包。 So, which is the correct 'start' and 'end'? 那么，正确的“开始”和“结束”是什么？ Is there a better way to scrape this data than what I am trying to do? 是否有比我尝试的更好的方式来抓取这些数据？

Answer 1

1 . 1。 Iterate all script tags and search target json 迭代所有script标签并搜索目标json

2 . 2。 Use regex to grab start and end 使用regex获取start和end

3 . 3。 Use json module 使用json模块

for i in soup.select('script'):
    if 'Product.Config' in str(i):
        data = re.search(r'(?is)(Product\.Config\()(.*?)(\))',str(i)).group(2)

json_data = json.loads(data)
print(len(json_data['attributes']['179']['options']))
9

以JSON格式抓取内容-Python

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-17 11:44:47

以JSON格式抓取内容-Python

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-17 11:44:47

解决方案1
1 已采纳 2017-10-17 11:44:47