简体   繁体   English

抓取网站缺失数据

[英]Scraping site missing data

So I'm trying to scrape the open positions on this site and when I use any type of requests (currently trying request-html) it doesn't show everything that's in the HTML.所以我试图在这个网站上抓取空缺职位,当我使用任何类型的请求(目前正在尝试 request-html)时,它不会显示 HTML 中的所有内容。

# Import libraries
import time
from bs4 import BeautifulSoup
from requests_html import HTMLSession

# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'

session = HTMLSession()

# Connect to the URL
response = session.get(url)

response.html.render()

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html5lib")  

b = soup.findAll('a')

Not sure where to go.不知道去哪里。 Originally thought the problem was due to javascript rendering but this is not working.最初认为问题是由于 javascript 渲染造成的,但这不起作用。

I don't think it's possible to scrape that website with Requests.我认为用 Requests 来抓取那个网站是不可能的。 I would suggest using Selenium or Scrapy.我建议使用 Selenium 或 Scrapy。

The issue is that the initial GET doesn't get the data (which I assume is the job listings), and the js that does do that, uses a POST with a authorization token in the header.问题是初始 GET 没有获取数据(我假设是工作列表),而执行此操作的 js 使用 POST 并在标头中带有授权令牌。 You need to get this token and then make the POST to get the data.您需要获取此令牌,然后进行 POST 以获取数据。

This token appears to be dynamic so we're going to get a little wonky getting it, but doable.这个令牌似乎是动态的,所以我们得到它会有点不稳定,但可行。

url0=r'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
url=r'https://germanamerican.csod.com/services/x/career-site/v1/search'

s=HTMLSession()
r=s.get(url0)
print(r.status_code)
r.html.render()

soup=bs(r.text,'html.parser')

scripts=soup.find_all('script')

for script in scripts:
    if 'csod.context=' in script.text: x=script

j=json.loads(x.text.replace('csod.context=','').replace(';',''))


payload={
    'careerSiteId': 5,
    'cities': [],
    'countryCodes': [],
    'cultureId': 1,
    'cultureName': "en-US",
    'customFieldCheckboxKeys': [],
    'customFieldDropdowns': [],
    'customFieldRadios': [],
    'pageNumber': 1,
    'pageSize': 25,
    'placeID': "",
    'postingsWithinDays': None,
    'radius': None,
    'searchText': "",
    'states': []
}

headers={
    'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'authorization': 'Bearer '+j['token'],
    'cache-control': 'no-cache',
    'content-length': '272',
    'content-type': 'application/json',
    'csod-accept-language': 'en-US',
    'origin': 'https://germanamerican.csod.com',
    'referer': 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'x-requested-with': 'XMLHttpRequest'
}

r=s.post(url,headers=headers,json=payload)
print(r.status_code)
print(r.json())

the r.json() thats printed out is a nice json format version of the table of job listings.打印出来的r.json()是一个很好的 json 格式版本的工作列表。

Welcome to SO!欢迎来到 SO!

Unfortunately, you won't be able to scrape that page with requests (nor requests_html or similar libraries) because you need a tool to handle dynamic pages - ie, javascript-based.不幸的是,您将无法使用requests (也不是requests_html或类似库)抓取该页面,因为您需要一个工具来处理动态页面 - 即基于 javascript 的。

With python, I would strongly suggest selenium and its webdriver .使用 python,我强烈建议使用selenium和它的webdriver Below a piece of code that prints the desired output - ie, all listed jobs (NB it requires selenium and Firefox webdriver to be installed and with the correct PATH to run)在打印所需输出的一段代码下方 - 即所有列出的作业(注意,它需要安装selenium和 Firefox webdriver 并使用正确的 PATH 运行)

# Import libraries
from bs4 import BeautifulSoup
from selenium import webdriver

# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'

browser = webdriver.Firefox() # initialize the webdriver. I use FF, might be Chromium or else

browser.get(url) # go to the desired page. You might want to wait a bit in case of slow connection
page = browser.page_source # this is the page source, now full with the listings that have been uploaded
soup = BeautifulSoup(page, "lxml")
jobs = soup.findAll('a', {'data-tag' : 'displayJobTitle'})
for j in jobs:
    print(j.text)

browser.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM