简体   繁体   English

仅从网站获取 JSON 的一部分,我正在尝试使用 Python、BeautifulSoup、请求来抓取。 从 62 条回复中获得 20 条回复

[英]Only getting a portion of JSON from website I am trying to scrape using Python, BeautifulSoup, Requests. Getting 20 responses out of 62

I am trying to scrape this site for job openings:我正在尝试在此网站上搜索职位空缺:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst= https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst=

I looked in dev tools and saw that the page makes an XHR request to this site to retrieve the job opening(s) information which is in the form of a JSON object:我查看了开发工具,发现该页面向该站点发出 XHR 请求,以检索 JSON object 形式的职位空缺信息:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults

So I'm like "Great: I can parse this in two seconds using a python program like this":所以我喜欢“太好了:我可以使用这样的 python 程序在两秒钟内解析这个”:

''' from bs4 import BeautifulSoup import json import requests ''' from bs4 import BeautifulSoup import json 导入请求

def crawl():
    union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
    soup = BeautifulSoup(union, 'html.parser')
    newDict = json.loads(str(soup))
    for job in newDict['opportunities']:
        print(job['Title'])

crawl() '''

Well it turns out that this page only returns 20 job openings out of 62. So I went back to the page and loaded the entirety of the page (clicked "view more opportunities")事实证明,这个页面只返回了 62 个职位空缺中的 20 个。所以我回到页面并加载了整个页面(点击“查看更多机会”)

And it said that it sent another XHR request to that same link, yet only 20 records are shown when I look.它说它向同一个链接发送了另一个 XHR 请求,但当我查看时只显示 20 条记录。

How can I scrape all of the records from this page?我怎样才能从这个页面刮掉所有的记录? And if someone could explain what is going on behind the scenes that would be great.如果有人能解释幕后发生的事情,那就太好了。 I am a little new to web scraping so any insight is appreciated.我对 web 抓取有点陌生,所以任何见解都值得赞赏。

You don't need do a scraping, like you say the API that return all json is the link https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults but you need set in body request this parameters You don't need do a scraping, like you say the API that return all json is the link https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults but you需要在body request这个参数中设置

import requests

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)

And here using pandas (pip install pandas)在这里使用 pandas (pip install pandas)

import requests
import pandas as pd
pd.set_option('display.width', 1000)

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))

Where data has "TOP" 62 like a limited your results:数据具有“TOP” 62 的地方限制了您的结果:

{
  "opportunitySearch": {
    "Top": 62,
    "Skip": 0,
    "QueryString": "",
    "OrderBy": [
      {
        "Value": "postedDateDesc",
        "PropertyName": "PostedDate",
        "Ascending": false
      }
    ],
    "Filters": [
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 4,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 5,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 6,
        "extra": null,
        "values": [

        ]
      }
    ]
  },
  "matchCriteria": {
    "PreferredJobs": [

    ],
    "Educations": [

    ],
    "LicenseAndCertifications": [

    ],
    "Skills": [

    ],
    "hasNoLicenses": false,
    "SkippedSkills": [

    ]
  }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我正在尝试使用 beautifulsoup4 抓取网站并请求库 - I am trying to scrape a website using beautifulsoup4 and requests library 我正在尝试使用python通过请求将数据提交到网站。 如何通过确认对话框? - Using python i am trying to submit data to a website via requests. How to pass the confirmation dialog? 我正在尝试使用 python 和 BeautifulSoup 来 web 抓取 ebay,但我得到的列表索引超出范围错误 - I'm trying to web scrape ebay using python and BeautifulSoup, but I'm getting a list index out of rangeerror 尝试使用 Python 请求在 Oanda 上下订单。 获取 JSON 错误 - Trying to place an order on Oanda using Python Requests. Getting JSON error 尝试使用python3和beautifulSoup抓取网站,但返回一个空列表 - Trying to scrape a website with python3 and beautifulSoup, but getting an empty list back 尝试使用 BS4 从 Trustpilot 抓取日期 web 时出现以下 JSON 错误 - Python - I am getting the following JSON error when trying to web scrape dates from Trustpilot with BS4 - Python 使用 python 请求和 BeautifulSoup 从带有框架或 flexbox 的网站中抓取数据 - Scrape data from website with frames or flexbox using python requests and BeautifulSoup 我正在尝试使用 Python 3.x 从亚马逊抓取评论,但一无所获 - I am trying to Scrape reviews from Amazon using Python 3.x but getting nothing 使用 beautifulsoup 从网站获取 json 数据 - Getting the json data from the website using beautifulsoup Spyne:为什么我收到针对json请求的空响应? - Spyne: Why am I getting empty responses for json requests?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM