简体   繁体   English

BeautifulSoup 从中提取数据<script> with unicode

[英]BeautifulSoup extract data from <script> with unicode

I am trying to parse a JSON in a script of an html page.我正在尝试在 html 页面的脚本中解析 JSON。

import requests
from bs4 import BeautifulSoup
import json

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36 OPR/64.0.3417.47",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "fr-FR,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Upgrade-Insecure-Requests": "1",
}

proxy = "http://stack:overflow@45.135.149.142:14758"

url = "https://www.seloger.com/list.htm?projects=2%2C5&types=2%2C1&natures=1%2C2%2C4&places=%5B%7Bci%3A60088%7D%5D&enterprise=0&qsVersion=1.0"
r = requests.get(url, proxies={"http": proxy, "https": proxy}, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for script in soup.find_all('script'):
    if "initialData" in script.text:
        data = script.text.split('JSON.parse("', 1)[1].split('");window["tags"]', 1)[0]
        json_data = json.loads(data)

And this error is returned:并返回此错误:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The problem is that " is not converted to a quotation mark " which triggers the json decoder error.问题是 " 没有转换为引号 " ,这会触发 json 解码器错误。

Also, whenever I print the code of the script, " is printed instead of ". I have already tried encoding and decoding in multiple formats before passing it to json.loads but nothing worked.此外,每当我打印脚本代码时,都会打印 " 而不是 "。在将其传递给 json.loads 之前,我已经尝试以多种格式进行编码和解码,但没有任何效果。

Th problem only comes when the code is parsed directly from the request response.只有在直接从请求响应中解析代码时才会出现问题。 I cannot replicate the issue.我无法复制这个问题。 This code works as expected:此代码按预期工作:

snippet = '''<script>
window["initialData"] = JSON.parse("{\u0022foo\u0022:\u0022bar\u0022,\u0022xxx\u0022:\u0022xyz\u0022}")
</script>
'''
soup = BeautifulSoup(snippet, 'html.parser')
for script in soup.find_all('script'):
    data = script.text.split('JSON.parse("')[1].split('")')[0]
    json_data = json.loads(data)
    print(json_data)
    # output : {'foo': 'bar', 'xxx': 'xyz'}

How can I fix this?我怎样才能解决这个问题?

You can do it with a regex (not the best way), but this worked for me:您可以使用正则表达式(不是最好的方法)来做到这一点,但这对我有用:

import requests,json,re
proxy='http://stack:overflow@45.135.149.142:14758'
url='https://www.seloger.com/list.htm?projects=2%2C5&types=2%2C1&natures=1%2C2%2C4&places=%5B%7Bci%3A60088%7D%5D&enterprise=0&qsVersion=1.0'
r = requests.get(url, proxies={"http": proxy, "https": proxy})
json_data = json.loads(json.loads('"' + re.search(r'JSON\.parse\("(.+)"\);w', r.text).group(1) + '"')) # note it needs to be double wrapped
json_data.keys()
# dict_keys(['cards', 'navigation', 'SEO', 'tracking', 'adverts', 'bookmarks', 'failure', 'engine'])

The content you're trying to scrape includes unicode escape sequences which seem to actually be escaped themselves.您尝试抓取的内容包括 unicode 转义序列,这些序列似乎实际上已被转义。

The solution I've found involves encoding and then decoding the string, although there might be a better way:我发现的解决方案涉及编码然后解码字符串,尽管可能有更好的方法:

data.encode("utf-8").decode("unicode-escape")

I also made some other tweaks to your code, particularly in how the data is parsed, seen in this snippet which I used as a demo/test:我还对您的代码进行了一些其他调整,特别是在解析数据的方式方面,在我用作演示/测试的这个片段中可以看到:

json_re = re.compile(r"window\[\"initialData\"] = JSON\.parse\(\"(.*)\"\);window\[\"tags\"]")

with open("../out/temp.html", 'rb') as file_in:
    soup = BeautifulSoup(file_in.read(), 'lxml')

raw_data = ""
for script in soup.find_all('script'):
    if "initialData" in script.text:
        res_text = script.get_text(strip=True)
        raw_data = json_re.search(res_text).group(1)
        break

print(raw_data)
t_1 = raw_data.encode("utf-8")
print(t_1)

t_2 = t_1.decode("unicode-escape")
print(t_2)

t_3 = json.loads(t_2)
print(t_3)

Maybe the problem is the splitting step, but I had similar problems with json and encodings.也许问题是拆分步骤,但我在 json 和编码方面遇到了类似的问题。 Try to recode the response.text with 'utf-8' or 'latin'.尝试使用“utf-8”或“latin”重新编码 response.text。 Some times it worked for me.有时它对我有用。 Hope it helps!希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM