简体   繁体   English

将字符串从抓取的 javascript 转换为 python 字典

[英]Converting a string from scraped javascript into a python dictionary

I want to crawl the product description of the product in the link below.我想在下面的链接中爬取产品的产品描述。

在此处输入图像描述

I tried to crawl by using selenium , but the information is protected by the website so all the information I get by selenium is the same with requests .我尝试使用selenium进行爬网,但信息受网站保护,因此我通过selenium获得的所有信息与requests相同。 So to make to script run faster, I crawl it by using requests .所以为了让脚本运行得更快,我使用requests来抓取它。

Below is the code:下面是代码:

import requests
from bs4 import BeautifulSoup as BS

res= requests.get("https://www.real.de/product/345246038/")
soup=BS(res.text,'lxml')
code=soup.prettify()
split =  code.split("attributes:")
for value in split:
    after=value.split(",condition$:b")
    for value in after:
        if "{default:[{name:" in value:
            clean = value.replace(",highlighted:void 0}}","}").replace(": None","")
       

Here is the string in the variable clean :这是变量clean中的字符串:

在此处输入图像描述

I convert the clean into a dictionary:我将 clean 转换为字典:

import yaml
d = yaml.load(clean)

But it is not properly formatted like a dictionary: because not all the words are in the double quote ( "" )但它的格式不像字典那样正确:因为并非所有单词都在双引号中( "" )

So I use regrex to extract only the word in the string that are not in double quote.因此,我使用正则表达式仅提取字符串中不在双引号中的单词。 Here is the code:这是代码:

r = re.compile(r'[{,:][a-zA-z]+[:}]', flags=re.I | re.X)
string = r.findall(clean)  
ta=[]          
for w in string :
    m = re.search('[a-zA-z]+', w)
    if m:
        new = str('"')+m.group(0)+str('"')
        ta.append(new)
                        

However.然而。 I don't know how to put the words which are in the double quote ("") inside the clean variable again.我不知道如何将双引号 ("") 中的单词再次放入clean变量中。

Can you help me?你能帮助我吗?

you can try (?!") that mean match character that not followed by quote您可以尝试(?!")表示不带引号的匹配字符

if "{default:[{name:" in value:
    clean = value.replace(",highlighted:void 0}}","}").replace(": None","")
    # add the lines below
    clean = re.sub(r'(\{|,)(?!")(\w+?):', r'\1"\2":', clean)
    clean = re.sub(r':(?!")(\w+?)(\}|,)', r':"\1"\2', clean)
    jsonData = json.loads(clean)
    print(json.dumps(jsonData, indent=2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM