简体   繁体   中英

Python extract json structure from html page

in python i'm reading an html page content which contains a lot of stuff. To do this i read the webpage as string by this way:

url = 'https://myurl.com/'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')

if i print the reddit_data i can see correctly the whole html contents. Now, inside it there's a structure like json that i would like to read and extract some fields from that.

Below the structure:

"dealDetails" : {
      "f240141a" : {
         "egressUrl" : "https://ccc.com",
         "title" : "ZZZ",
         "type" : "ghi",
      },
      "5f9ab246" : {
         "egressUrl" : "https://www.bbb.com/",
         "title" : "YYY",
         "type" : "def",
      },
      "2bf6723b" : {
         "egressUrl" : "https://www.aaa.com//",
         "title" : "XXX",
         "type" : "abc",
      },
}

What i want to do is: find the dealDetails field and then for each f240141a 5f9ab246 2bf6723b get the egressURL, title and type values.

Thanks

Try this,

[nested_dict['egressUrl'] for nested_dict in reddit_data['dealDetails'].keys()]

To access the values of JSON, you can consider as dictionary and use the same syntax to access values as well.

Edit-1:

Make sure your type of reddit_data is a dictionary.

if type(reddit_data) is str .

You need to do..

import ast
reddit_data = ast.literal_eval(reddit_data)

OR

import json
reddit_data = json.loads(reddit_data)
  • If you just wanted to know how to access the egressURL, title and the type. You might just wanna read the answer below, Be careful however, cause the following code won't work unless you converted your HTML file reddit_data in something like a dictionary ( Modified shaik moeed 's answer a tiny bit to also return title and type):
[(i['egressUrl'], i['title'], i['type']) for i in reddit_data['dealDetails'].keys()]
  • However, If I got it right, the part you're missing is the conversion from HTML to a JSON friendly file. What I personally use, even though it's quite unpopular, is the eval function
dictionary = eval(reddit_data)

This will convert the whole file into a dictionary, I recommend that you only use it on the part of the text that 'looks' like a dictionary, (One of the reason eval is unpopular, is because it won't convert strings like 'true'/'false' to Python's True/False: be careful with that :) )

Hope that helped!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM