I have a csv file with the following data: Year, Title, Author. eg:
Year,Title,Author
2018,Becoming,Michelle Obama
2018,Educated,Tara Westover
2018,Grant,Ron Chernow
I want to add two more columns, one for word count and one for page count.
I have written the following script which opens a web page, searches for the book and extracts word count and page count information.
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
print(word_count)
print(page_count)
time.sleep(5)
driver.quit()
I would like to do the following:
Get the title from the csv file and input it into the search. Extract the word count and page count information and add it to the respective row and column. Repeat for every title/row in the csv.
Any help would be greatly appreciated!
In python the best way to cope with .csv-files is to use a package called pandas. Pandas has a function to read a csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html From there on, you can do a lot of stuff with your csv (in pandas it is then represented as a special data type called DataFrame). See, for example https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/ how to add columns.
Of course, you can read the csv-file using another package - it is called csv and a short tutorial is shown here https://realpython.com/python-csv/
I hope this is going to help you :)
Something like this should work. Please amend as needed.
import pandas as pd
def web_search(title: str):
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
print(word_count)
print(page_count)
time.sleep(5)
driver.quit()
return word_count, page_count
df = pd.read_csv(file)
for index, row in df.iterrows():
print("Retrieving "+ str(row.title))
word_count, page_count = web_search(row.title)
df.loc[index,'word_count'] = word_count
df.loc[index, 'page_count'] = page_count
df.to_csv('newfile.csv')
Using the pandas package seems the most convenient way of doing this. pandas provides the DataFrame
class which has nice methods to read and write csv, and also an apply
method with which we can create new columns based on values of other columns. Your use case would look something like this (I did not test your code, just pasted it into the fetch_data()
function):
import time
import pandas as pd
from selenium import webdriver
def fetch_data(title):
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
time.sleep(5)
driver.quit()
return word_count, page_count
def process_file(input_file_path, output_file_path):
df = pandas.read_csv(input_file_path)
df[['word_count', 'page_count']] = df['title'].apply(fetch_data).apply(pd.Series)
df.to_csv(output_file_path)
The main advantage of pandas - performing operations on dataframes quick - is pretty much irrelevant in your case, because the web parsing is ways more time-costly, but doing it this way with pandas is still a very convenient, concise and readable way of writing the code, I'd say.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.