I am using scrapy to crawl a website. Some pages use AJAX so I got the AJAX requests to get the actual data. so far so good. The output of those AJAX requests are JSON outputs. Now I would like to parse the JSON but scray just provides HtmlXPathSelector. Has anybody transformed successfully a json output into html and able to parse it with HtmlXPathSelector?
thank you very much in advance
import json
response = json.loads(jsonResponse)
The code above will decode the json you receive. Afterwards, you should be able to process it any way you want.
(Replace jsonResponse
with the json that you get from the ajax request)
Slightly complicated, still works.
If you're interested in working with xpaths on JSON outputs..
Disclaimer : May not be the optimal soln. +1 if someone improves this approach.
install dicttoxml package (pip recommended)
-Download the output using scrapy's traditional Request module
in spider:
from scrapy.selector import XmlXPathSelector
import lxml.etree as etree
request = Request(link, callback=self.parse_resp)
yield request
def parse_resp(self,response):
json=response.body
#Now load the contents using python's JSON module
json_dict = json.loads(json)
#transform the contents into xml using dicttoxml
xml = dicttoxml.dicttoxml(json_dict)
xml = etree.fromstring(xml)
#Apply scrapy's XmlXPathSelector module,and start using xpaths
xml = XmlXPathSelector(text=xml)
data = xml.select(".//*[@id='count']/text()").extract()
return data
I did this because, i'm maintaining all the xpaths of all the spiders in one place (config-files)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.