简体   繁体   中英

json encoded as UTF-8 characters. How do I process as json in Python Requests

I am scraping a website that is rendering a JavaScript/JSON Object that looks like this:

{ "company": "\r\n            \x3cdiv class=\"page-heading\"\x3e\x3ch1\x3eSEARCH
 RESULTS 1 - 40 OF 200\x3c/h1\x3e\x3c/div\x3e\r\n\r\n             
\x3cdiv class=\"right-content-list\"\x3e\r\n\r\n                
\x3cdiv class=\"top-buttons-adm-lft\"\x3e\r\n   

I am attempting to process this as a JSON Object (which is what this looks like) using Python's Requests library.

I have used the following methods to encode/process the text:

unicodedata.normalize("NFKD", get_city_json.text).encode('utf-8', 'ignore')
unicodedata.normalize("NFKD", get_city_json.text).encode('ascii', 'ignore')
unicode(get_city_json.text)

However, even after repeated attempts, the text is rendered with the UTF-8 encoding and its characters. The Content-Type returned by the web app is "text/javascript; charset=utf-8"

I want to be able to process it as a regular JSON/JavaScript Object for parsing and reading.

Help would be greatly appreciated!

That isn't UTF-8 . It HTML encoded text.

You can decode it using the following:

Python 2

import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(json_value)
print unescaped

Python 3

import html.parser    
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(json_value)
print unescaped

If you unescape your string with these you should get

<div class="page-heading"><h1>SEARCH RESULTS 1 - 40 OF 200</h1></div>
<div class="right-content-list">
<div class="top-buttons-adm-lft">

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM