简体   繁体   中英

python scrape webpage and parse the content

I want to scrape the data on this link

http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json

I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:

import requests

url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text

The type of the source is unicode. I also try to use the urllib2 to scrape like:

source2=urllib2.urlopen(url).read()

The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/

Thanks.

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets

return_json("JSON code to copy")

In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

The response is text. It does contain JSON, just need to extract it

import json

strip_len = len("return_json(")

source=requests.get(url).text[strip_len:-2]
source = json.loads(source) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM