使用 BeautifulSoup 抓取 HTML

Question

我想在以下網站上進行抓取，這是一個案例庫： https : //engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391

打算提取的特征是：

“組織”、“行業”、“職稱”、“Filed_By”、“狀態、年份”、“摘要”（正文）

我的問題是如何抓取每個案例並使程序循環遍歷所有頁面？

我的代碼中的 URL 只是第一種情況，但我需要遍歷存儲庫中的所有頁面（88 頁）並將它們寫入 CSV

我想知道在這種情況下使用 lambda 是否可行

也有人可以就如何理解和識別 html 標簽中的模式以供將來使用，因為我是這個領域的新手。

以下代碼是我目前所擁有的：

url = "https://engagements.ceres.org/ceres_engagementdetailpage?recID=a0l1H00000CDy78QAD"

page = requests.get(url, verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

Answer 1

我認為您需要將 bs 與 selenium 結合使用，因為某些內容加載速度稍慢。 您可以使用 bs 獲取初始鏈接，然后使用 selenium 並等待以確保加載每個頁面上的內容。 您需要首先處理證書問題。

我不確定摘要是什么，所以我提供了所有的 p 標簽。 這意味着一些重復的信息。 你可以細化這個。

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

baseUrl = 'https://engagements.ceres.org'
results = []
driver = webdriver.Chrome()

r = requests.get('https://engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391', verify=False)
soup = bs(r.content, 'lxml')
items =  [baseUrl + item['href'] for item in soup.select("[href*='ceres_engagementdetailpage?recID=']")]

for item in items:
    driver.get(item)
    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "p")))
    title = driver.find_element_by_css_selector('.resolutionsTitle').text
    organisation = driver.find_element_by_css_selector('#description p').text
    year = driver.find_element_by_css_selector('#description p + p').text
    aList = driver.find_elements_by_css_selector('.td2')
    industry = aList[0].text
    filedBy = aList[2].text
    status = aList[5].text
    summary = [item.text for item in driver.find_elements_by_css_selector('#description p')]
    results.append([organization, industry, title, filedBy, status, year, summary])
df = pd.DataFrame(results, headers = ['Organization', 'Industry', 'Title', 'Filed By', 'Status', 'Year', 'Summary'])
print(results)

使用 BeautifulSoup 抓取 HTML

問題描述

1 個解決方案

解決方案1
0 已采納 2019-03-15 21:44:03

使用 BeautifulSoup 抓取 HTML

問題描述

1 個解決方案

解決方案1 0 已采納 2019-03-15 21:44:03

解決方案1
0 已采納 2019-03-15 21:44:03