從 web 頁面抓取多個帖子

Question

我試圖抓取頁面上的每一項工作，但我沒有成功。 我一直在嘗試不同的方法，但我沒有成功。 打開並抓取第一個作品后，腳本會崩潰。 我不知道接下來我應該做什么才能繼續下一份工作。 有沒有人幫我讓它工作？ 先感謝您。 我不得不縮短代碼，因為它不允許我全部發布（代碼太多）。

 # Part 1 from selenium import webdriver import pandas as pd from bs4 import BeautifulSoup from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.chrome.options import Options from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager options = Options() driver = webdriver.Chrome(ChromeDriverManager().install(), options=options) df = pd.DataFrame(columns=["Title","Description",'Job-type','Skills']) for i in range(25): driver.get('https://www.reed.co.uk/jobs/care-jobs?pageno='+ str(i)) jobs = [] driver.implicitly_wait(20) for job in driver.find_elements_by_xpath('//*[@id="content"]/div[1]/div[3]'): soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser') element = WebDriverWait(driver, 50).until( EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))) element.click() try: title = soup.find("h3",class_="title").text.replace("\n","").strip() print(title) except: title = 'None' sum_div = job.find_element_by_css_selector('#jobSection42826858 > div.row > div > header > h3 > a') sum_div.click() driver.implicitly_wait(2) try: job_desc = driver.find_element_by_css_selector('#content > div > div.col-xs-12.col-sm-12.col-md-12 > article > div > div.branded-job-details--container > div.branded-job--content > div.branded-job--description-container > div').text #print(job_desc) except: job_desc = 'None' try: job_type = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[2]/div/div/div[3]/div[3]/span').text #print(job_type) except: job_type = 'None' try: job_skills = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[6]/div[2]/ul').text #print(job_skills) except: job_skills = 'None' driver.back() driver.implicitly_wait(2) df = df.append({'Title':title,"Description":job_desc,'Job-type':job_type,'Skills':job_skills},ignore_index=True) df.to_csv(r"C:\Users\Desktop\Python\newreed.csv",index=False)

Answer 1

在我看來，使用 selenium 管理 Chrome 比使用 firefox 或 edge 更棘手。 如果不需要 chrome，那么我會嘗試使用 firefox 或 Edge 驅動程序。 當 Chrome 給我帶來問題時，我對 Edge 很幸運。

Answer 2

您應該避免使用 Selenium（它最初不是為 web 抓取而設計的）。 您應該調查 F12 -> Network -> html 或 xhr 選項卡。

這是我的代碼：

import requests as rq
from bs4 import BeautifulSoup as bs

def processPageData(soup):
    articles = soup.find_all("article")
    resultats = {}
    for article in articles:

        resultats[article["id"][10:]] = {}

        res1 = article.find_all("div", {"class", "metadata"})[0]
        location = res1.find("li", {"class", "location"}).text.strip().split('\n')
        resultats[article["id"][10:]]['location'] = list(map(str.strip, location))
        resultats[article["id"][10:]]['salary'] = res1.find("li", {"class", "salary"}).text

        resultats[article["id"][10:]]['description'] = article.find_all("div", {"class", "description"})[0].find("p").text

        resultats[article["id"][10:]]['posted_by'] = article.find_all("div", {"class", "posted-by"})[0].text.strip()
    
    return resultats

在前面的 function 上迭代：

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
           "Host": "www.reed.co.uk"}
            
resultats = {}

for i in range(1, 10):
    url = " https://www.reed.co.uk/jobs/care-jobs?pageno=%d" % i

    s = rq.session()
    resp = s.get(url, headers=headers)#.text
    soup = bs(resp.text, "lxml")
    r = processPageData(soup)
    resultats.update(r)

給出：

{'42826858': {'location': ['Horsham', 'West Sussex'],
  'salary': '£11.50 - £14.20 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

 '42827040': {'location': ['Redhill', 'Surrey'],
  'salary': '£11.00 - £13.00 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

....

注意 1： resultats鍵是標識符，如果您需要更多詳細信息，您可以導航到工作頁面。

注 2：我從 1 到 10 的頁面進行迭代； 但是您可以嘗試調整代碼以使其具有最大頁數。

注3：（作為一般建議）盡量理解網站的數據model，而不是過多嘗試，除非以錯誤的方式使用selenium。

注 4： css 選擇器和 xpath 選擇器很難看； 更喜歡按標簽進行更清潔的選擇。 （個人觀點）

從 web 頁面抓取多個帖子

問題描述

2 個解決方案

解決方案1
0 2021-05-26 17:15:58

解決方案2
0 2021-06-10 17:59:39

從 web 頁面抓取多個帖子

問題描述

2 個解決方案

解決方案1 0 2021-05-26 17:15:58

解決方案2 0 2021-06-10 17:59:39

解決方案1
0 2021-05-26 17:15:58

解決方案2
0 2021-06-10 17:59:39