简体   繁体   中英

Python: How can I break loop and append the last page of results?

I've made a scraper that works except that it wont scrape the last page. The url doesn't change, so I set it up to run on an infinite loop.

Ive set the loop up to break when it cant click on the next button anymore( on the last page), and it seems that the script is ending before it appends the last past of results.

How can I append the last page the list?

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import itertools


url = "https://example.com"

driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get(url)

inputElement = driver.find_element_by_id("txtBusinessName")
inputElement.send_keys("ship")

inputElement.send_keys(Keys.ENTER)

df2 = pd.DataFrame()

for i in itertools.count():
    element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "grid_businessList")))
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    table = soup.find('table', id="grid_businessList")
    rows = table.findAll("tr")

    columns = [v.text.replace('\xa0', ' ') for v in rows[0].find_all('th')]

    df = pd.DataFrame(columns=columns)

    for i in range(1, len(rows)):
        tds = rows[i].find_all('td')

        if len(tds) == 5:
            values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text]
        else:
            values = [td.text for td in tds]

        df = df.append(pd.Series(values, index=columns), ignore_index=True)

    try:
        next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
        driver.execute_script("arguments[0].click();", next_button)
        sleep(5)

    except NoSuchElementException:
        break

    df2 = df2.append(df)
    df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)

The problem is that except will break the loop before you have append the last page.

What you can do is to use a finally statement in your try - except statement. The code in the finally block will always run, see https://docs.python.org/3/tutorial/errors.html#defining-clean-up-actions

Your code can be rewritten to this:

    try:
        next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
        driver.execute_script("arguments[0].click();", next_button)
        sleep(5)

    except NoSuchElementException:
        break

    finally:
        df2 = df2.append(df)
        df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM