简体   繁体   English

使用 BeautifulSoup 抓取 HTML

[英]HTML Scraping Using BeautifulSoup

I would like to perform scraping on the following website, a repository of cases: https://engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391我想在以下网站上进行抓取,这是一个案例库: https : //engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391

The features intend to extract are:打算提取的特征是:

'Organization', "Industry","Title", "Filed_By", 'Status, Year','Summary'(main body text) “组织”、“行业”、“职称”、“Filed_By”、“状态、年份”、“摘要”(正文)

My question is how do I scrape by each case and have the program loop through all pages ?我的问题是如何抓取每个案例并使程序循环遍历所有页面?

the URL in my code is only the first case but I need to loop through all the pages in the repository (88pages) and write them into CSV我的代码中的 URL 只是第一种情况,但我需要遍历存储库中的所有页面(88 页)并将它们写入 CSV

I am wondering if using lambda would work in this case我想知道在这种情况下使用 lambda 是否可行

Also can someone kindly shed some lights on how to understand and identify patterns in the html tags for future use because I am new to this field.也有人可以就如何理解和识别 html 标签中的模式以供将来使用,因为我是这个领域的新手。

The following code is what I have at this moment:以下代码是我目前所拥有的:

url = "https://engagements.ceres.org/ceres_engagementdetailpage?recID=a0l1H00000CDy78QAD"

page = requests.get(url, verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

I think you need to combine bs with selenium as some content is a little slower to load.我认为您需要将 bs 与 selenium 结合使用,因为某些内容加载速度稍慢。 You can use bs to grab the initial links and then use selenium and waits to ensure content on each page is loaded.您可以使用 bs 获取初始链接,然后使用 selenium 并等待以确保加载每个页面上的内容。 You need to handle the certificate problem initially.您需要首先处理证书问题。

I am not sure what summary is so I provide all the p tags.我不确定摘要是什么,所以我提供了所有的 p 标签。 This means some duplicated info.这意味着一些重复的信息。 You can refine this.你可以细化这个。

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

baseUrl = 'https://engagements.ceres.org'
results = []
driver = webdriver.Chrome()

r = requests.get('https://engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391', verify=False)
soup = bs(r.content, 'lxml')
items =  [baseUrl + item['href'] for item in soup.select("[href*='ceres_engagementdetailpage?recID=']")]

for item in items:
    driver.get(item)
    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "p")))
    title = driver.find_element_by_css_selector('.resolutionsTitle').text
    organisation = driver.find_element_by_css_selector('#description p').text
    year = driver.find_element_by_css_selector('#description p + p').text
    aList = driver.find_elements_by_css_selector('.td2')
    industry = aList[0].text
    filedBy = aList[2].text
    status = aList[5].text
    summary = [item.text for item in driver.find_elements_by_css_selector('#description p')]
    results.append([organization, industry, title, filedBy, status, year, summary])
df = pd.DataFrame(results, headers = ['Organization', 'Industry', 'Title', 'Filed By', 'Status', 'Year', 'Summary'])
print(results)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM