简体   繁体   中英

HTML Scraping Using BeautifulSoup

I would like to perform scraping on the following website, a repository of cases: https://engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391

The features intend to extract are:

'Organization', "Industry","Title", "Filed_By", 'Status, Year','Summary'(main body text)

My question is how do I scrape by each case and have the program loop through all pages ?

the URL in my code is only the first case but I need to loop through all the pages in the repository (88pages) and write them into CSV

I am wondering if using lambda would work in this case

Also can someone kindly shed some lights on how to understand and identify patterns in the html tags for future use because I am new to this field.

The following code is what I have at this moment:

url = "https://engagements.ceres.org/ceres_engagementdetailpage?recID=a0l1H00000CDy78QAD"

page = requests.get(url, verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

I think you need to combine bs with selenium as some content is a little slower to load. You can use bs to grab the initial links and then use selenium and waits to ensure content on each page is loaded. You need to handle the certificate problem initially.

I am not sure what summary is so I provide all the p tags. This means some duplicated info. You can refine this.

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

baseUrl = 'https://engagements.ceres.org'
results = []
driver = webdriver.Chrome()

r = requests.get('https://engagements.ceres.org/?_ga=2.157917299.852607976.1552678391-697747477.1552678391', verify=False)
soup = bs(r.content, 'lxml')
items =  [baseUrl + item['href'] for item in soup.select("[href*='ceres_engagementdetailpage?recID=']")]

for item in items:
    driver.get(item)
    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "p")))
    title = driver.find_element_by_css_selector('.resolutionsTitle').text
    organisation = driver.find_element_by_css_selector('#description p').text
    year = driver.find_element_by_css_selector('#description p + p').text
    aList = driver.find_elements_by_css_selector('.td2')
    industry = aList[0].text
    filedBy = aList[2].text
    status = aList[5].text
    summary = [item.text for item in driver.find_elements_by_css_selector('#description p')]
    results.append([organization, industry, title, filedBy, status, year, summary])
df = pd.DataFrame(results, headers = ['Organization', 'Industry', 'Title', 'Filed By', 'Status', 'Year', 'Summary'])
print(results)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM