简体   繁体   English

如何使用 selenium 从 CSV 文件打开 URL?

[英]How can I open URLs from a CSV file with selenium?

I'm trying to save the data from a profile on Google Scholar into a CSV.我正在尝试将 Google Scholar 上的个人资料中的数据保存到 CSV 中。 The profile has a 'Show More' button, and I can get all the data from it (here I only saved the data from the table but I need all the data from the profile) but the problem is that I saved the data twice or even more times sometimes, and I think it's because I saved it while I was clicking and not after I had clicked all the 'Show More', so how can I do that?配置文件有一个“显示更多”按钮,我可以从中获取所有数据(这里我只保存了表中的数据,但我需要配置文件中的所有数据)但问题是我保存了两次数据或有时甚至更多次,我认为这是因为我在点击时保存了它,而不是在我点击了所有“显示更多”之后,所以我该怎么做呢? Also, here I used only one URL, but there are more, and I have them saved in another CSV, so how do I open each URL from there to do what I do here?另外,这里我只用了一个URL,但还有更多,我把它们保存在另一个CSV中,那么我如何打开每个URL从这里到那里做什么? (I only need the Link row) the CSV with the URLs look like this https://drive.google.com/file/d/1zkTlzYaOQ7FVoSdd5OMnE8QgwS8NOik7/view?usp=sharing (我只需要链接行)带有 URL 的 CSV 看起来像这样https://drive.google.com/file/d/1zkTlzYaOQ7FVoSdd5OMnE8QgwS8NOik7/view?usp=sharing

from selenium.webdriver.support.ui import WebDriverWait as W
from selenium.webdriver.support import expected_conditions as EC
from selenium.common import exceptions as SE
from selenium import webdriver
import time
from csv import writer

chrome_path=r"C:\Users\gvste\Desktop\proyecto\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)

urls = ["https://scholar.google.com/citations?hl=en&user=gQb_tFMAAAAJ"]

button_locators = "//button[@class='gs_btnPD gs_in_ib gs_btn_flat gs_btn_lrge gs_btn_lsu']"
wait_time = 2

wait = W(driver, wait_time)

for url in urls:
    data = {}
    driver.get(url)

    button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))

    while button_link:
        try:
            wait.until(EC.visibility_of_element_located((By.ID,'gsc_a_tw')))
            data = driver.find_elements_by_class_name("gsc_a_tr")



            button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
            button_link.click()
            time.sleep(2)

            with open('perfil.csv','a', encoding="utf-8", newline='') as s:
                 csv_writer =writer(s)
                 for i in range(len(data)):
                     paper = driver.find_elements_by_class_name("gsc_a_t")
                     citas = driver.find_elements_by_class_name("gsc_a_c")
                     año = driver.find_elements_by_class_name("gsc_a_y")  
                     p = paper[i].text.replace(',', '')
                     c = citas[i].text.replace(',', '')
                     a = año[i].text.replace(',', '')            
                     csv_writer.writerow([ p, c, a])

        except SE.TimeoutException:
            print(f'Página parseada {url}')
            break

driver.quit()

For the first part I didn't really get what's happening.对于第一部分,我并没有真正了解发生了什么。 But for the second part you can change URLs from hard code to a function (put the loop in function) and you can use pandas library for CSV (it's much better).但是对于第二部分,您可以将 URL 从硬代码更改为 function(将循环放入函数中),您可以使用 pandas 库来更好地使用 CSV(它。 This is for getting the URLS这是为了获取 URLS

import pandas as pd
df = pd.read_csv(csv_file)
urls = df['column_name'] 

Here is the most basic way read data from CSV file:这是从 CSV 文件中读取数据的最基本方法:

import csv
with open('filename.csv', 'r') as file:
reader = csv.reader(filename)
for row in reader:
    print(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM