简体   繁体   中英

How to extract multiple independently nested JSON objects and keys from a website using Python

I want to extract multiple independent JSON objects and associated keys from a web page. By "independently nested," I mean each JSON object is nested within a script type = "application/ld+json element.

I am currently using beautifulsoup , json , and requests to try and accomplish this task, but I can't get it to work. I have read through similar posts (eg, here , here , and here ), but none of them address this issue. Specifically, how to extract multiple independently nested JSON objects simultaneously and then extract specific keys from among those objects. Other examples assume the JSON objects are all within one nest.

Here is a working example of where I am currently at:

# Using Python 3.8.1, 32 bit, Windows 10

from bs4 import BeautifulSoup

import requests

import json


#%% Create variable with website location

reno = 'https://www.foodpantries.org/ci/nv-reno'


#%% Downlod the webpage

renoContent = requests.get(reno)


#%% Make into nested html

renoHtml = BeautifulSoup(renoContent.text, 'html.parser')


#%% Keep only the HTML that contains the JSON objects I want

spanList = renoHtml.find("div", class_="span8")


#%% Get JSON objects.

data = json.loads(spanList.find('script', type='application/ld+json').text)

print(data)

This is where I am stuck. I can get the JSON data for the first location, however, I can't get it for the other 9 locations that are listed in the spanList variable. How can I have Python get me the JSON data from the other 9 locations? I did try spanList.find_all but that returns a AttributeError: ResultSet object has no attribute 'text' . But if I remove .text from json.loads , I get TypeError: the JSON object must be str, bytes or bytearray, not ResultSet .

My hunch is that this is complicated because each JSON object has its own script type = "application/ld+jso attribute. None of the other examples I saw had a similar situation. It seems json.loads is only recognizing that first JSON object and then stopping.

The other complication is that the number of locations changes based on the city. I am hoping there is a solution that will automatically pull all the locations no matter how many are on the page (eg, Reno has 10 but Las Vegas has 20).

I also couldn't figure out how to extract the keys from this JSON load using the key names such as name and streetAddress. This could be based on how how I am extracting the JSON object via json.dumps but I am unsure.

Here is an example of how the JSON object is laid out

           <script type = "application/ld+json">
            {
            "@context": "https://schema.org",
            "@type": "LocalBusiness",
            "address": {
            "@type":"PostalAddress",
            "streetAddress":"2301 Kings Row",
            "addressLocality":"Reno",
            "addressRegion":"NV",
            "postalCode": "89503"
            },
            "name": "Desert Springs Baptist Church"
            ,"image": 
             "https://www.foodpantries.org/gallery/28591_desert_springs_baptist_church_89503_wzb.jpg"
            ,"description": "Provides a food pantry.  Must provide ID and be willing to fill out intake 
              form Pantry.Hours: Friday 11:00am - 12:00pmFor more information, please call. "
            ,"telephone":"(775) 746-0692"
            }

My ultimate goal is to export the data contained within the keys name , streetAddress , addressLocality , addressRegion , and postalCode to a CSV file.

IIUC, you just need to call the .find_all method in your spanList to get all the json objects.

Try this:

from bs4 import BeautifulSoup
import requests
import json

reno = 'https://www.foodpantries.org/ci/nv-reno'
renoContent = requests.get(reno)
renoHtml = BeautifulSoup(renoContent.text, 'html.parser')
json_scripts = renoHtml.find("div", class_="span8").find_all('script', type='application/ld+json')
data = [json.loads(script.text, strict=False) for script in json_scripts] 
#use strict=False to bypass json.decoder.JSONDecodeError: Invalid control character
print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM