简体   繁体   中英

Using python to read data from csv file as input and writing output into csv file

I have a csv file with the following data: Year, Title, Author. eg:

Year,Title,Author
2018,Becoming,Michelle Obama
2018,Educated,Tara Westover
2018,Grant,Ron Chernow

I want to add two more columns, one for word count and one for page count.

I have written the following script which opens a web page, searches for the book and extracts word count and page count information.

driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
print(word_count)
print(page_count)
time.sleep(5)
driver.quit()

I would like to do the following:

Get the title from the csv file and input it into the search. Extract the word count and page count information and add it to the respective row and column. Repeat for every title/row in the csv.

Any help would be greatly appreciated!

In python the best way to cope with .csv-files is to use a package called pandas. Pandas has a function to read a csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html From there on, you can do a lot of stuff with your csv (in pandas it is then represented as a special data type called DataFrame). See, for example https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/ how to add columns.

Of course, you can read the csv-file using another package - it is called csv and a short tutorial is shown here https://realpython.com/python-csv/

I hope this is going to help you :)

Something like this should work. Please amend as needed.

import pandas as pd

def web_search(title: str):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    print(word_count)
    print(page_count)
    time.sleep(5) 
    driver.quit()
    return word_count, page_count

df = pd.read_csv(file)

for index, row in df.iterrows():
    print("Retrieving "+ str(row.title))
    word_count, page_count = web_search(row.title)
    df.loc[index,'word_count'] = word_count
    df.loc[index, 'page_count'] = page_count

df.to_csv('newfile.csv')

Using the pandas package seems the most convenient way of doing this. pandas provides the DataFrame class which has nice methods to read and write csv, and also an apply method with which we can create new columns based on values of other columns. Your use case would look something like this (I did not test your code, just pasted it into the fetch_data() function):

import time
import pandas as pd
from selenium import webdriver


def fetch_data(title):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    time.sleep(5) 
    driver.quit()

    return word_count, page_count

def process_file(input_file_path, output_file_path):
    df = pandas.read_csv(input_file_path)
    df[['word_count', 'page_count']] = df['title'].apply(fetch_data).apply(pd.Series)

    df.to_csv(output_file_path)

The main advantage of pandas - performing operations on dataframes quick - is pretty much irrelevant in your case, because the web parsing is ways more time-costly, but doing it this way with pandas is still a very convenient, concise and readable way of writing the code, I'd say.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM