简体   繁体   中英

Calling on cells in dataframe using Selenium - iterating through a dataframe to write into a website search bar

So Far I Have : A python script that can call on Chromedriver, enter a singular url and bring the results out of page speed reading.

What I am Looking to Do : Create a loop that takes multiple URLs from an excel file one at a time, loads a page speed test, pulls the results, and repeats the process until all the URLs have been read.

from selenium import webdriver
import time
import pandas as pd

dataSheet = pd.read_excel("URL_Test_File.xlsx")
df = pd.DataFrame()
pageSpeed = []

for data in dataSheet:
    armyURL = dataSheet['URLs']
    browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
    browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
    time.sleep(3)
    searchBar = browser.find_element_by_name('url')
    searchBar.send_keys(armyURL)
    searchBar.send_keys(u'\ue007')
    time.sleep(7)
    scoreCard = browser.find_element_by_class_name('speed-report-card-score')
    df["Speed Results"] = scoreCard
    clearBar = browser.find_element_by_name('url')
    clearBar.clear()

(I am relatively new to coding so I know that things are a little sloppy at the moment)

Since you haven't given a link for your excel file I have created one with same column name as yours.

You can download it from here: https://drive.google.com/open?id=1eelHqJcnNdKNIDYL7NIgwwdNsUEFqL4U

In case in future file gets deleted the excel file is as follows:

dataSheet = pd.read_excel("URL_Test_File.xlsx")
print(dataSheet)

Output:

           URLs
0     yahoo.com
1  facebook.com
2    google.com

The mistakes you have made :

First Mistake-

for data in dataSheet

will only give all the column names. Try this:

for data in dataSheet:
    print(data)

OUTPUT will be:

URLs

To iterate through URLs column of excel sheet you need to do this:

for armyURL in dataSheet['URLs']:
    print(armyURL)

Second Mistake: This can't be considered as mistake but since you want to analyze all the sites in the same tab you need to declare browser before for loop. Because if you declare browser inside for loop it will open new browser window for every URL so you clearing URL search bar is of no use.

Third Mistake:

df["Speed Results"] = scoreCard

won't add anything in your data frame. Try this:

df = pd.DataFrame()
for i in range(3):
    df["Speed Results"]=i
print(df)

Output will be just

Speed Results

You need to use either iloc or loc methods to insert values in Data Frame. Google about them. I have used loc for solution. You need to pass row number to enter values DataFrame so I have initialzed a variable i=0 before for loop to keep count of rows and incremented it by 1 at the end of loop. Try this:

df = pd.DataFrame()
df["Speed Results"]="" 
'''
you can specify columns in Dataframe declaration too like:
df = pd.DataFrame(index=None,columns=["Speed Results"])
'''
for i in range(3):
    df.loc[i]=i
print(df)

Output:

    Speed Results
0   0
1   1
2   2

Fourth Mistake: Since you want to add score in your data frame which is a text you need to use text attribute for the same.

scoreCard = browser.find_element_by_class_name('speed-report-card-score')
df.loc[i]= scoreCard.text

What you should have added:

Sometimes browser may take time to load elements and meanwhile if selenium searchs for some element which isn't loaded yet it may give error. So use WebDriverWait to make selenium wait for element to be loaded.

I have added a while loop which waits until Score Card is loaded.

Full code:

import pandas as pd
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument("start-maximized")

cpath="C:/Users/Downloads/chromedriver_win32/chromedriver.exe"


dataSheet = pd.read_excel("C:/Users/Downloads/URL_Test_File.xlsx")
df = pd.DataFrame(index=None,columns=["Speed Results"])
#df["Speed Results"]=""
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=cpath)

i=0

for armyURL in dataSheet['URLs']:
    browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=cpath)

    #browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
    browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
    sleep(3)
    searchBar = browser.find_element_by_name('url')
    searchBar.send_keys(armyURL)
    searchBar.send_keys(Keys.RETURN)
    sleep(7)
    while(True):
        try:
            WebDriverWait(browser,10).until(EC.presence_of_element_located((By.CLASS_NAME,'speed-report-card-score')))
            break
        except:
            pass
    scoreCard = browser.find_element_by_class_name('speed-report-card-score')
    #scoreCard=browser.find_element_by_xpath('//div[@class="speed-report"]/div[@class="speed-report-card left"]/p[@class="speed-report-card-score"]/span[@class="fast"]')
    df.loc[i]= scoreCard.text
    clearBar = browser.find_element_by_name('url')
    clearBar.clear()
    i+=1

print(df)

OUTPUT:

      Speed Results
0  1.2s FCP2.2s DCL
1  1.7s FCP3.1s DCL
2  0.7s FCP0.7s DCL

Assuming you're getting your data in from the Excel sheet and the parsing is occurring correctly, this new code should do what you want. You need to either append data to your df or you can use something like I have here, the pd.DataFrame.from_dict() function to create the data frame from a dictionary of your data:

from selenium import webdriver
import time
import pandas as pd

dataSheet = pd.read_excel("URL_Test_File.xlsx")
#df = pd.DataFrame()  # We will create the df at the end
pageSpeed = []
url_list = [] # Create a list to collect your URLs as you iterate

for data in dataSheet:
    armyURL = dataSheet['URLs']
    browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
    browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
    time.sleep(3)
    searchBar = browser.find_element_by_name('url')
    searchBar.send_keys(armyURL)
    searchBar.send_keys(u'\ue007')
    time.sleep(7)
    scoreCard = browser.find_element_by_class_name('speed-report-card-score')
    pageSpeed.append(scoreCard) # Add the speed data to your pageSpeed[] list
    url_list.append(armyURL) # Add the URL data to your url_list[] list
    clearBar = browser.find_element_by_name('url')
    clearBar.clear()
    browser.quit() # Close the browser since we'll open a new one up the next time (and we should always have a .quit() at the end of our Selenium code)

speed_test_dict = {'Pages': url_list, 'Page Speed': pageSpeed}
df = pd.DataFrame.from_dict(speed_test_dict)

Since I don't have your Excel file, I can't fully test, but this should work (or I will edit/modify if have any issues)

Are you looking for something like this?

...
# add the right number of columns based on the number of elements in 
# scoreCard_list (see below)
result = pd.DataFrame(columns=["column a", "column b"]) 
counter = 0
for data in dataSheet:
  counter += 1
  ...
  scoreCard_list = scoreCard.text.split("\s+") # or choose other delimiter to split on
  result.loc[counter] = scoreCard_list
  ...

Update :

I realized that there were more flaws in my initial code than expected, especially calling on the data frame within the loop which uses the dataframe as a perimeter. This is what I had ultimately written out that got this loop to work (thanks Leo and dblclik for looking over this).

from selenium import webdriver
import time
import pandas as pd

dataSheet = pd.read_excel("URL_Test_File.xlsx") #test file is has column label URLs
df = pd.DataFrame()
pageSpeed = []

browser = webdriver.Chrome('C:\\Webdriver\\chromedriver')
browser.get(('https://developers.google.com/speed/pagespeed/insights/'))
time.sleep(3)

for i in dataSheet["URLs"]: #Specifying the exact part of the data frame to iterate over
    enterURL = i
    searchBar = browser.find_element_by_name('url')
    searchBar.send_keys(armyURL)
    searchBar.send_keys(u'\ue007')
    time.sleep(7)
    scoreCard = browser.find_element_by_class_name('speed-report-card-score')
    df["Speed Results"] = scoreCard.text
    clearBar = browser.find_element_by_name('url')
    clearBar.clear()

In using this there are still some issues when it comes to accurate information gathering and appending that still need to be troubleshot, but for those with the same issue in iterating through a dataframe with Selenium, this should be a halfway decent start.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM