将字符串从抓取的 javascript 转换为 python 字典

Question

I want to crawl the product description of the product in the link below.我想在下面的链接中爬取产品的产品描述。

I tried to crawl by using selenium , but the information is protected by the website so all the information I get by selenium is the same with requests .我尝试使用selenium进行爬网，但信息受网站保护，因此我通过selenium获得的所有信息与requests相同。 So to make to script run faster, I crawl it by using requests .所以为了让脚本运行得更快，我使用requests来抓取它。

Below is the code:下面是代码：

import requests
from bs4 import BeautifulSoup as BS

res= requests.get("https://www.real.de/product/345246038/")
soup=BS(res.text,'lxml')
code=soup.prettify()
split =  code.split("attributes:")
for value in split:
    after=value.split(",condition$:b")
    for value in after:
        if "{default:[{name:" in value:
            clean = value.replace(",highlighted:void 0}}","}").replace(": None","")

Here is the string in the variable clean :这是变量clean中的字符串：

I convert the clean into a dictionary:我将 clean 转换为字典：

import yaml
d = yaml.load(clean)

But it is not properly formatted like a dictionary: because not all the words are in the double quote ( "" )但它的格式不像字典那样正确：因为并非所有单词都在双引号中（ "" ）

So I use regrex to extract only the word in the string that are not in double quote.因此，我使用正则表达式仅提取字符串中不在双引号中的单词。 Here is the code:这是代码：

r = re.compile(r'[{,:][a-zA-z]+[:}]', flags=re.I | re.X)
string = r.findall(clean)  
ta=[]          
for w in string :
    m = re.search('[a-zA-z]+', w)
    if m:
        new = str('"')+m.group(0)+str('"')
        ta.append(new)

However.然而。 I don't know how to put the words which are in the double quote ("") inside the clean variable again.我不知道如何将双引号 ("") 中的单词再次放入clean变量中。

Can you help me?你能帮助我吗？

Answer 1

you can try (?!") that mean match character that not followed by quote您可以尝试(?!")表示不带引号的匹配字符

if "{default:[{name:" in value:
    clean = value.replace(",highlighted:void 0}}","}").replace(": None","")
    # add the lines below
    clean = re.sub(r'(\{|,)(?!")(\w+?):', r'\1"\2":', clean)
    clean = re.sub(r':(?!")(\w+?)(\}|,)', r':"\1"\2', clean)
    jsonData = json.loads(clean)
    print(json.dumps(jsonData, indent=2))

将字符串从抓取的 javascript 转换为 python 字典

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-24 15:20:12

将字符串从抓取的 javascript 转换为 python 字典

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-24 15:20:12

解决方案1
1 已采纳 2021-03-24 15:20:12