简体   繁体   中英

ValueError: Invalid \escape: When readin json as respons in Scrapy

During parsing i get text object response with json in it. They all look very much alike. And some of them work without any errors. But others throw an error as below.

I tried to use replace('\\r\\n', '') and , strict=False. To no avail.

Here is the URL i get json from - enter link description here Here is my code. (Line 51 is data=json.loads )

Also when i try this url in scrapy shell it opens up empty and throw another error - no json document located. Do not know if this is important.

def parse_jsn(self, response):
        #inspect_response(response, self)

        data = json.loads(response.body_as_unicode())
        item = response.meta['item']
        item['text']= data[0]['bodyfull']
        yield item

Here is the error code.

ValueError: Invalid \escape: line 4 column 942 (char 945)
2017-03-25 17:21:19 [scrapy.core.scraper] ERROR: Spider error processing <GET
or.com/UserReviewController?a=mobile&r=434622632> (referer: https://www.tripa
w-g60763-d122005-Reviews-or490-The_New_Yorker_A_Wyndham_Hotel-New_York_City_N
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in it
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", l
der_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", l
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py",

    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", lin
    return (r for r in result or () if _filter(r))
  File "C:\Code\Active\tripadvisor\tripadvisor\spiders\mtripad.py", line 51,
    data = json.loads(response.body_as_unicode(), strict=False)
  File "c:\python27\lib\json\__init__.py", line 352, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "c:\python27\lib\json\decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\python27\lib\json\decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 4 column 579 (char 582)

First of all, +1 for scraping the mobile API. Much more clever than scraping from HTML!

Indeed there is a issue with the encoding.There are some octal encoded characters ( [...] \\074br/\\076\\074br/\\076Best Regards,\\074br/\\076Emily [...] ) that breaks the JSON parsing. To get rid of them use:

response.body.decode('unicode-escape')

Also there are some encoded HTML characters in the data: "&#x201c;Nice clean and perfectly average&#x201d;" . I suggest to unescape them:

from HTMLParser import HTMLParser
...
json.loads(HTMLParser().unescape(response.body.decode('unicode-escape'))
...

In Python 3:

import html 
...
json.loads(html.unescape(response.body.decode('unicode-escape')))

The result should look like: [{'title': '“Nice clean and perfectly average”', 'bodyfull': '[...] stay. <br/><br/>Best Regards,<br/>Emily Rodriguez", [...]}] [{'title': '“Nice clean and perfectly average”', 'bodyfull': '[...] stay. <br/><br/>Best Regards,<br/>Emily Rodriguez", [...]}]

As you see, there is some HTML tags in the result. If you want to remove the HTML tags you could use a RegEx like:

import re
...
p = re.compile(r'<.*?>')
no_html = p.sub('', str_html))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM