简体   繁体   English

使用 Beautiful soup 对动态内容进行网页抓取

[英]Webscraping of dynamic content with Beautiful soup

To train my python skills I tried to scrap the number of open jobs for a specific given job from the webpresence of the "Arbeitsagentur" ( https://www.arbeitsagentur.de/jobsuche/ ).为了训练我的 Python 技能,我尝试从“Arbeitsagentur”( https://www.arbeitsagentur.de/jobsuche/ )的网络存在中删除特定给定工作的空缺职位数量。 I used the web-developer inspection tool of the firefox browser to extract the text out of the item containing the information, eg "12.231 Jobs für Informatiker/in".我使用firefox浏览器的web-developer检查工具从包含信息的项目中提取文本,例如“12.231 Jobs für Informatiker/in”。 My code:我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait

content = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(executable_path="C:/Drivers/geckodriver/geckodriver.exe", options=options)
driver.get(content)
soup = BeautifulSoup(driver.page_source, 'html.parser')
num_jobs = soup.select_one('div[class="h1-zeile-suche-speichern-container-content container-fluid"] h2')
print(num_jobs)
driver.close()

As result I get the extraction of the correct line but it does not include the information queried.结果,我提取了正确的行,但它不包括查询的信息。 Translated in english I get this output:翻译成英文我得到这个输出:

<h2 _ngcontent-serverapp-c39="" class="h6" id="suchergebnis-h1-anzeige">Jobs for Informatiker/in are loaded</h2>

In the web-inspector of firefox I see instead:在 Firefox 的网络检查器中,我看到的是:

<h2 id="suchergebnis-h1-anzeige" class="h6" _ngcontent-serverapp-c39="">
12.231 Jobs für Informatiker/in</h2>

I tried the WebDriverWait method and driver.implicitly_wait() to wait until the webpage is loaded completely but without success.我尝试了 WebDriverWait 方法和 driver.implicitly_wait() 等到网页完全加载但没有成功。 Probably this value is calculated and inserted by a js-script(?).可能这个值是由 js-script(?) 计算和插入的。 As I am not a web developer I don't know how this works and how to extract the line with the given number of jobs correctly.因为我不是网络开发人员,所以我不知道这是如何工作的,也不知道如何正确提取给定数量的工作。 I tried to use the debugger of the firefox developer tools to see where / how the value is calculated.我尝试使用 firefox 开发人员工具的调试器来查看值的计算位置/方式。 But most scripts are only very cryptic one-liners.但大多数脚本只是非常神秘的单行字。

(Extracting the number/value out of the string / text line by means of a regular expression will be no problem at all). (通过正则表达式从字符串/文本行中提取数字/值完全没有问题)。

I really would appreciate your support or any useful hint.我真的很感谢您的支持或任何有用的提示。

Since the contents are dynamically loaded, you can parse the number of job result only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.由于内容是动态加载的,因此只有在某个元素可见后才能解析number of job result ,在这种情况下,所有元素都将被加载,您可以成功解析所需的数据。

You can also increase the sleep time to load all data but that's a bad solution.您还可以增加睡眠时间以加载所有数据,但这是一个糟糕的解决方案。

Working code -工作代码 -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)


def arbeitsagentur_scraper():
    URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
    with chrome_driver as driver:
        driver.implicitly_wait(15)
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        
        # time.sleep(10) # increase the load time to fetch all element, not advised solution
       
        # wait until this element is visible 
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
        
        elem = driver.find_element(By.XPATH,
                                   '/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
        print(elem.text)


arbeitsagentur_scraper()

Output -输出 -

12.165 Jobs für Informatiker/in

Alternatively, you can use their API URL to load the results.或者,您可以使用他们的 API URL 来加载结果。 For example:例如:

import json
import requests


api_url = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service/pc/v4/jobs"

query = {
    "angebotsart": "1",
    "was": "Informatiker/in",
    "page": "1",
    "size": "25",
    "pav": "false",
}

headers = {
    "OAuthAccessToken": "eyJhbGciOiJIUzUxMiJ9.eyAic3ViIjogIklkNFZSNmJoZFpKSjgwQ2VsbHk4MHI4YWpkMD0iLCAiaXNzIjogIk9BRyIsICJpYXQiOiAxNjU0MDM2ODQ1LCAiZXhwIjogMS42NTQwNDA0NDVFOSwgImF1ZCI6IFsgIk9BRyIgXSwgIm9hdXRoLnNjb3BlcyI6ICJhcG9rX21ldGFzdWdnZXN0LCBqb2Jib2Vyc2Vfc3VnZ2VzdC1zZXJ2aWNlLCBhYXMsIGpvYmJvZXJzZV9rYXRhbG9nZS1zZXJ2aWNlLCBqb2Jib2Vyc2Vfam9ic3VjaGUtc2VydmljZSwgaGVhZGVyZm9vdGVyX2hmLCBhcG9rX2hmLCBqb2Jib2Vyc2VfcHJvZmlsLXNlcnZpY2UiLCAib2F1dGguY2xpZW50X2lkIjogImRjZGVhY2JkLTJiNjItNDI2MS1hMWZhLWQ3MjAyYjU3OTg0OCIgfQ.BBkJbJ93fGqQQQGX4-VTzX8P6Twg8Rthq8meXV2WY_CoUmXQWhdgbjkFozP2BJXooSr7yLaTJr7JXEk8hDnCWA",
}

data = requests.get(api_url, params=query, headers=headers).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["maxErgebnisse"])

Prints:印刷:

12165

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM