简体   繁体   English

我在抓取的 JSON 上收到 KeyError

[英]I am getting a KeyError on a JSON that I scraped

I have scraped a JSON from a website.我从网站上抓取了一个 JSON。 When trying to iterate through the JSON I get a KeyError , but I'm unsure why.当尝试遍历 JSON 时,我得到一个KeyError ,但我不确定为什么。 The loop is within the length of the JSON. Any ideas as to what is going on?循环在 JSON 的长度内。关于发生了什么有任何想法吗?

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
url = "https://employment.ucsd.edu/jobs?page_size=250&page_number=1&keyword=clinical%20lab%20scientist&location_city" \
      "=Remote&location_city=San%20Diego&location_city=Encinitas&location_city=Murrieta&location_city=La%20Jolla" \
      "&location_city=Not%20Specified&location_city=Vista&sort_by=score&sort_order=DESC "
request = requests.get(url, headers=headers)
response = BeautifulSoup(request.text, "html.parser")
all_data = response.find_all("script", {"type": "application/ld+json"})
df = pd.DataFrame(columns=("Title", "Department", "Salary Range", "Appointment Percent", "URL"))

for data in all_data:
    jsn = json.loads(data.string)
    jsn_length = len(jsn['itemListElement'])
    # print(json.dumps(jsn, indent=4))
    n = 0
    while n < jsn_length:
        # print(jsn['itemListElement'][n])
        df['URL'] = jsn['itemListElement'][n]
        n += 1

Edit: response编辑:回应

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2022.1\plugins\python\helpers\pydev\pydevd.py", line 1491, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2022.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Will/PycharmProjects/UCSD_JOB_SCRAPE/main.py", line 19, in <module>
    jsn_length = len(jsn['itemListElement'])
KeyError: 'itemListElement'

Element number 250 in the JSON you referenced really doesn't seem to have an itemListElement key:您引用的 JSON 中的元素编号 250 似乎确实没有itemListElement键:

  "@context": "https://schema.org",
  "@type": "Organization",
  "url": "https://health.ucsd.edu/",
  "logo": "https://dy5f5j6i37p1a.cloudfront.net/company/logos/157272/original/b228c5f9007911ecb905ed1c0f90d00e.png",
  "name": "UC San Diego "

The safest thing is probably to explicitly check against it.最安全的事情可能是明确检查它。 Eg:例如:

for data in all_data:
    jsn = json.loads(data.string)
    if jsn.get('itemListElement') is None:
        print('No itemListElement in the JSON. The JSON is\n' + data.string)
        jsn_length = len(jsn['itemListElement'])
        n = 0
        while n < jsn_length:
            # print(jsn['itemListElement'][n])
            df['URL'] = jsn['itemListElement'][n]
            n += 1

To get list of URLs into a DataFrame you can use next example:要获取 DataFrame 中的 URL 列表,您可以使用下一个示例:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"

url = (
    "&location_city=Not%20Specified&location_city=Vista&sort_by=score&sort_order=DESC "

request = requests.get(url, headers=headers)
soup = BeautifulSoup(request.content, "html.parser")

data = json.loads(soup.find("script", {"type": "application/ld+json"}).text)

urls = []
for e in data["itemListElement"]:

df = pd.DataFrame({"URL": urls})


0      http://employment.ucsd.edu/clinical-lab-scientist-specialist-119559/job/20822209
1      http://employment.ucsd.edu/clinical-lab-scientist-specialist-120139/job/21460814
2      http://employment.ucsd.edu/clinical-lab-scientist-specialist-120483/job/21869984
3   http://employment.ucsd.edu/sr-clinical-lab-scientist-specialist-118105/job/20528292
4  http://employment.ucsd.edu/cls-clinical-lab-scientist-specialist-119095/job/20528293

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM