简体   繁体   English

如何使用 webdriver 将来自多个页面的数据保存到单个 csv

[英]how to save data from multiple pages using webdriver into a single csv

so i'm trying to save data from googlescholar using selenium (webdriver) and so far i can print the data that i want, but i when i saved it into a csv it only saves the first page所以我正在尝试使用selenium(webdriver)从googlescholar保存数据,到目前为止我可以打印我想要的数据,但是当我将它保存到csv时它只保存第一页

from selenium import webdriver
from selenium.webdriver.common.by import By
# Import statements for explicit wait
from selenium.webdriver.support.ui import WebDriverWait as W
from selenium.webdriver.support import expected_conditions as EC
import time
import csv
from csv import writer

exec_path = r"C:\Users\gvste\Desktop\proyecto\chromedriver.exe"
URL = r"https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=8337597745079551909"

button_locators = ['//*[@id="gsc_authors_bottom_pag"]/div/button[2]', '//*[@id="gsc_authors_bottom_pag"]/div/button[2]','//*[@id="gsc_authors_bottom_pag"]/div/button[2]']
wait_time = 3
driver = webdriver.Chrome(executable_path=exec_path)
driver.get(URL)
wait = W(driver, wait_time)
#driver.maximize_window()
for j in range(len(button_locators)):
    button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators[j])))

address = driver.find_elements_by_class_name("gsc_1usr")

    #for post in address:
        #print(post.text)
time.sleep(4)

with open('post.csv','a') as s:
    for i in range(len(address)):

        addresst = address
            #if addresst == 'NONE':
            #   addresst = str(address)
            #else:
        addresst = address[i].text.replace('\n',',')
        s.write(addresst+ '\n')

button_link.click()
time.sleep(4)

    #driver.quit()

You only get one first page data because your program stops after it clicks next page button.您只会获得一个首页数据,因为您的程序在单击下一页按钮后停止。 You have to put all that in a for loop.你必须把所有这些都放在一个 for 循环中。

Notice i wrote in range(7), because I know there are 7 pages to open, in reality we should never do that.注意我在 range(7) 中写的,因为我知道有 7 页要打开,实际上我们不应该这样做。 Imagine if we have thousands of pages.想象一下,如果我们有数千页。 We should add some logic to check if the "next page button" exists or something and loop until it doesn't我们应该添加一些逻辑来检查“下一页按钮”是否存在或其他东西并循环直到它不存在

exec_path = r"C:\Users\gvste\Desktop\proyecto\chromedriver.exe"
URL = r"https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=8337597745079551909"

button_locators = "/html/body/div/div[8]/div[2]/div/div[12]/div/button[2]"
wait_time = 3
driver = webdriver.Chrome(executable_path=exec_path)
driver.get(URL)
wait = W(driver, wait_time)

time.sleep(4)

# 7 pages. In reality, we should get this number programmatically 
for page in range(7):

    # read data from new page
    address = driver.find_elements_by_class_name("gsc_1usr")

    # write to file
    with open('post.csv','a') as s:
        for i in range(len(address)):
            addresst = address[i].text.replace('\n',',')
            s.write(addresst+ '\n')

    # find and click next page button
    button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
    button_link.click()
    time.sleep(4)

also in the future you should look to change all these time.sleeps to wait.until .同样在将来,您应该将所有这些time.sleeps更改为wait.until Because sometimes your page loads quicker, and the program could do it's job faster.因为有时您的页面加载速度更快,而程序可以更快地完成它的工作。 Or even worse, your network might get a lag and that would screw up your script.或者更糟糕的是,您的网络可能会出现延迟,这会破坏您的脚本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM