So Far I Have : A python script that can call on Chromedriver, enter a singular url and bring the results out of page speed reading.
What I am Looking to Do : Create a loop that takes multiple URLs from an excel file one at a time, loads a page speed test, pulls the results, and repeats the process until all the URLs have been read.
from selenium import webdriver
import time
import pandas as pd
dataSheet = pd.read_excel("URL_Test_File.xlsx")
df = pd.DataFrame()
pageSpeed = []
for data in dataSheet:
armyURL = dataSheet['URLs']
browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
time.sleep(3)
searchBar = browser.find_element_by_name('url')
searchBar.send_keys(armyURL)
searchBar.send_keys(u'\ue007')
time.sleep(7)
scoreCard = browser.find_element_by_class_name('speed-report-card-score')
df["Speed Results"] = scoreCard
clearBar = browser.find_element_by_name('url')
clearBar.clear()
(I am relatively new to coding so I know that things are a little sloppy at the moment)
Since you haven't given a link for your excel file I have created one with same column name as yours.
You can download it from here: https://drive.google.com/open?id=1eelHqJcnNdKNIDYL7NIgwwdNsUEFqL4U
In case in future file gets deleted the excel file is as follows:
dataSheet = pd.read_excel("URL_Test_File.xlsx")
print(dataSheet)
Output:
URLs
0 yahoo.com
1 facebook.com
2 google.com
The mistakes you have made :
First Mistake-
for data in dataSheet
will only give all the column names. Try this:
for data in dataSheet:
print(data)
OUTPUT will be:
URLs
To iterate through URLs column of excel sheet you need to do this:
for armyURL in dataSheet['URLs']:
print(armyURL)
Second Mistake: This can't be considered as mistake but since you want to analyze all the sites in the same tab you need to declare browser
before for
loop. Because if you declare browser
inside for
loop it will open new browser window for every URL so you clearing URL search bar is of no use.
Third Mistake:
df["Speed Results"] = scoreCard
won't add anything in your data frame. Try this:
df = pd.DataFrame()
for i in range(3):
df["Speed Results"]=i
print(df)
Output will be just
Speed Results
You need to use either iloc
or loc
methods to insert values in Data Frame. Google about them. I have used loc
for solution. You need to pass row number
to enter values DataFrame so I have initialzed a variable i=0
before for
loop to keep count of rows and incremented it by 1 at the end of loop. Try this:
df = pd.DataFrame()
df["Speed Results"]=""
'''
you can specify columns in Dataframe declaration too like:
df = pd.DataFrame(index=None,columns=["Speed Results"])
'''
for i in range(3):
df.loc[i]=i
print(df)
Output:
Speed Results
0 0
1 1
2 2
Fourth Mistake: Since you want to add score in your data frame which is a text you need to use text
attribute for the same.
scoreCard = browser.find_element_by_class_name('speed-report-card-score')
df.loc[i]= scoreCard.text
What you should have added:
Sometimes browser may take time to load elements and meanwhile if selenium searchs for some element which isn't loaded yet it may give error. So use WebDriverWait to make selenium wait for element to be loaded.
I have added a while
loop which waits until Score Card is loaded.
Full code:
import pandas as pd
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
cpath="C:/Users/Downloads/chromedriver_win32/chromedriver.exe"
dataSheet = pd.read_excel("C:/Users/Downloads/URL_Test_File.xlsx")
df = pd.DataFrame(index=None,columns=["Speed Results"])
#df["Speed Results"]=""
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=cpath)
i=0
for armyURL in dataSheet['URLs']:
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=cpath)
#browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
sleep(3)
searchBar = browser.find_element_by_name('url')
searchBar.send_keys(armyURL)
searchBar.send_keys(Keys.RETURN)
sleep(7)
while(True):
try:
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.CLASS_NAME,'speed-report-card-score')))
break
except:
pass
scoreCard = browser.find_element_by_class_name('speed-report-card-score')
#scoreCard=browser.find_element_by_xpath('//div[@class="speed-report"]/div[@class="speed-report-card left"]/p[@class="speed-report-card-score"]/span[@class="fast"]')
df.loc[i]= scoreCard.text
clearBar = browser.find_element_by_name('url')
clearBar.clear()
i+=1
print(df)
OUTPUT:
Speed Results
0 1.2s FCP2.2s DCL
1 1.7s FCP3.1s DCL
2 0.7s FCP0.7s DCL
Assuming you're getting your data in from the Excel sheet and the parsing is occurring correctly, this new code should do what you want. You need to either append data to your df
or you can use something like I have here, the pd.DataFrame.from_dict()
function to create the data frame from a dictionary of your data:
from selenium import webdriver
import time
import pandas as pd
dataSheet = pd.read_excel("URL_Test_File.xlsx")
#df = pd.DataFrame() # We will create the df at the end
pageSpeed = []
url_list = [] # Create a list to collect your URLs as you iterate
for data in dataSheet:
armyURL = dataSheet['URLs']
browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
time.sleep(3)
searchBar = browser.find_element_by_name('url')
searchBar.send_keys(armyURL)
searchBar.send_keys(u'\ue007')
time.sleep(7)
scoreCard = browser.find_element_by_class_name('speed-report-card-score')
pageSpeed.append(scoreCard) # Add the speed data to your pageSpeed[] list
url_list.append(armyURL) # Add the URL data to your url_list[] list
clearBar = browser.find_element_by_name('url')
clearBar.clear()
browser.quit() # Close the browser since we'll open a new one up the next time (and we should always have a .quit() at the end of our Selenium code)
speed_test_dict = {'Pages': url_list, 'Page Speed': pageSpeed}
df = pd.DataFrame.from_dict(speed_test_dict)
Since I don't have your Excel file, I can't fully test, but this should work (or I will edit/modify if have any issues)
Are you looking for something like this?
...
# add the right number of columns based on the number of elements in
# scoreCard_list (see below)
result = pd.DataFrame(columns=["column a", "column b"])
counter = 0
for data in dataSheet:
counter += 1
...
scoreCard_list = scoreCard.text.split("\s+") # or choose other delimiter to split on
result.loc[counter] = scoreCard_list
...
Update :
I realized that there were more flaws in my initial code than expected, especially calling on the data frame within the loop which uses the dataframe as a perimeter. This is what I had ultimately written out that got this loop to work (thanks Leo and dblclik for looking over this).
from selenium import webdriver
import time
import pandas as pd
dataSheet = pd.read_excel("URL_Test_File.xlsx") #test file is has column label URLs
df = pd.DataFrame()
pageSpeed = []
browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
time.sleep(3)
for i in dataSheet["URLs"]: #Specifying the exact part of the data frame to iterate over
enterURL = i
searchBar = browser.find_element_by_name('url')
searchBar.send_keys(armyURL)
searchBar.send_keys(u'\ue007')
time.sleep(7)
scoreCard = browser.find_element_by_class_name('speed-report-card-score')
df["Speed Results"] = scoreCard.text
clearBar = browser.find_element_by_name('url')
clearBar.clear()
In using this there are still some issues when it comes to accurate information gathering and appending that still need to be troubleshot, but for those with the same issue in iterating through a dataframe with Selenium, this should be a halfway decent start.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.