简体   繁体   中英

Webscraping with selenium

I want to store in a data frame all the teams for the NHL $30K Finnish Flash on the 2019-01-10. I am able to store the team on the first page only so far. Moreover, if a user entered two different teams his highest ranking team is stored both times... Here is my code:

#Packages:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
import time

# Driver
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)

# DF taht will be use later 
results = pd.DataFrame()




calendar=[]
calendar.append("2019-01-10")


for d in calendar:
    driver.get("https://rotogrinders.com/resultsdb/date/"+d+"/sport/4/")

    time.sleep(10)
    contest= driver.find_element_by_xpath("//*[@id='root']/div/main/main/div[2]/div[3]/div/div/div[1]/div/div/div/div/div[3]")



    contest.click()
    list_links = driver.find_elements_by_tag_name('a')
    hlink=[]
    for ii in list_links:
        hlink.append(ii.get_attribute("href"))
    sub="https://rotogrinders.com/resultsdb"
    con= "contest"
    contest_list=[]
    for text in hlink:
        if sub in text:
            if con in text:
                contest_list.append(text)

    c=contest_list[2]
    driver.get(c)


    WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]')))


# Get tables to get the user names
    tables = pd.read_html(driver.page_source)
    users_df  = tables[0][['Rank','User']]
    users_df['User'] = users_df['User'].str.replace(' Member', '')

# Initialize results dataframe and iterate through users

    for i, row in users_df.iterrows():

        rank = row['Rank']
        user = row['User']

    # Find the user name and click on the name
        user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
        user_link.click()

    # Get the lineup table after clicking on the user name
        tables = pd.read_html(driver.page_source)
        lineup = tables[1]

    # Restructure to put into resutls dataframe
        lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
        lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']

        temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11), 
        columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )

        temp_df.insert(loc=0, column = 'User', value = user)
        temp_df.insert(loc=0, column = 'Rank', value = rank)
        temp_df["Date"]=d
        results = results.append(temp_df)        

    results = results.reset_index(drop=True)

driver.close()

So, I would like :

1) To iterate through all pages :

I did locate the next_page button; with :

next_button = driver.find_elements_by_xpath("//button[@type='button']")

But, I am not able to add that step in my for loop.

2)To access the differents user_link if a user entered more than once the contest. I think that maybe I could do it with a for loop using the frequency of a user like that:

users_df.groupby("User").count()

 for i in range(users_df[user,"Number"]):

     user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[i]
     user_link.click()

But, I always get some errors message when adding those steps. Or if it is working, it simply skip the part to store row by row all the teams and quickly close the driver...

My suggestions:

For you it will be enough if you will use just requests or any other equivalent module to get the data from server because the service you want to scrap has api server for example check the link . The example is using first end-point:

Hope this will makes your task easier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM