仅从网站获取 JSON 的一部分，我正在尝试使用 Python、BeautifulSoup、请求来抓取。从 62 条回复中获得 20 条回复

Question

I am trying to scrape this site for job openings:我正在尝试在此网站上搜索职位空缺：

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst= https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst=

I looked in dev tools and saw that the page makes an XHR request to this site to retrieve the job opening(s) information which is in the form of a JSON object:我查看了开发工具，发现该页面向该站点发出 XHR 请求，以检索 JSON object 形式的职位空缺信息：

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults

So I'm like "Great: I can parse this in two seconds using a python program like this":所以我喜欢“太好了：我可以使用这样的 python 程序在两秒钟内解析这个”：

''' from bs4 import BeautifulSoup import json import requests ''' from bs4 import BeautifulSoup import json 导入请求

def crawl():
    union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
    soup = BeautifulSoup(union, 'html.parser')
    newDict = json.loads(str(soup))
    for job in newDict['opportunities']:
        print(job['Title'])

crawl() '''

Well it turns out that this page only returns 20 job openings out of 62. So I went back to the page and loaded the entirety of the page (clicked "view more opportunities")事实证明，这个页面只返回了 62 个职位空缺中的 20 个。所以我回到页面并加载了整个页面（点击“查看更多机会”）

And it said that it sent another XHR request to that same link, yet only 20 records are shown when I look.它说它向同一个链接发送了另一个 XHR 请求，但当我查看时只显示 20 条记录。

How can I scrape all of the records from this page?我怎样才能从这个页面刮掉所有的记录？ And if someone could explain what is going on behind the scenes that would be great.如果有人能解释幕后发生的事情，那就太好了。 I am a little new to web scraping so any insight is appreciated.我对 web 抓取有点陌生，所以任何见解都值得赞赏。

Answer 1

You don't need do a scraping, like you say the API that return all json is the link https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults but you need set in body request this parameters You don't need do a scraping, like you say the API that return all json is the link https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults but you需要在body request这个参数中设置

import requests

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)

And here using pandas (pip install pandas)在这里使用 pandas (pip install pandas)

import requests
import pandas as pd
pd.set_option('display.width', 1000)

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))

Where data has "TOP" 62 like a limited your results:数据具有“TOP” 62 的地方限制了您的结果：

{
  "opportunitySearch": {
    "Top": 62,
    "Skip": 0,
    "QueryString": "",
    "OrderBy": [
      {
        "Value": "postedDateDesc",
        "PropertyName": "PostedDate",
        "Ascending": false
      }
    ],
    "Filters": [
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 4,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 5,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 6,
        "extra": null,
        "values": [

        ]
      }
    ]
  },
  "matchCriteria": {
    "PreferredJobs": [

    ],
    "Educations": [

    ],
    "LicenseAndCertifications": [

    ],
    "Skills": [

    ],
    "hasNoLicenses": false,
    "SkippedSkills": [

    ]
  }
}

仅从网站获取 JSON 的一部分，我正在尝试使用 Python、BeautifulSoup、请求来抓取。从 62 条回复中获得 20 条回复

问题描述

1 个解决方案

解决方案1
0 2019-11-03 03:02:02

仅从网站获取 JSON 的一部分，我正在尝试使用 Python、BeautifulSoup、请求来抓取。 从 62 条回复中获得 20 条回复

问题描述

1 个解决方案

解决方案1 0 2019-11-03 03:02:02

仅从网站获取 JSON 的一部分，我正在尝试使用 Python、BeautifulSoup、请求来抓取。从 62 条回复中获得 20 条回复

解决方案1
0 2019-11-03 03:02:02