简体   繁体   English

网页抓取 - Selenium - Python

[英]Webscraping - Selenium - Python

I want to extract all the fantasy teams that have been entered for past contests.我想提取所有参加过往届比赛的梦幻球队。 To loop through the dates, I just change a small part of the URL as shown in my code below:要遍历日期,我只需更改 URL 的一小部分,如下面的代码所示:

#Packages:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd


# Driver
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)

# Dataframe that will be use later 
results = pd.DataFrame()
best_lineups=pd.DataFrame()
opti_lineups=pd.DataFrame()

#For loop over all DATES:

calendar=[]
calendar.append("2019-01-10")
calendar.append("2019-01-11")

for d in calendar:
    driver.get("https://rotogrinders.com/resultsdb/date/"+d+"/sport/4/")

Then, to access the different contests of that day, you need to click on the contest tab.然后,要访问当天的不同比赛,您需要单击contest标签。 I use the following code to locate and click on it.我使用以下代码来定位并单击它。

 # Find "Contest" tab   
    contest= driver.find_element_by_xpath("//*[@id='root']/div/main/main/div[2]/div[3]/div/div/div[1]/div/div/div/div/div[3]")
    contest.click()

I simply inspect and copy the xpath of the tab.我只是检查并复制选项卡的 xpath。 However, most of the times it is working, but sometimes I get an error message " Unable to locate element...".但是,大多数情况下它都在工作,但有时我会收到一条错误消息“无法定位元素...”。 Moreover, it seems to work only for the first date in my calendar loop and always fails in the next iteration... I do not know why.此外,它似乎只适用于我日历循环中的第一个日期,并且在下一次迭代中总是失败......我不知道为什么。 I try to locate it differently, but I feel I am missing something such as:我尝试以不同的方式定位它,但我觉得我错过了一些东西,例如:

contests=driver.find_element_by_xpath("//*[@role='tab']

Once, the contest tab is successfully clicked, all contests of that day are there and you can click on a link to access all the entries of that contest.成功点击比赛标签后,当天的所有比赛都在那里,您可以单击链接访问该比赛的所有条目。 I stored the contests in order to iterate throuhg all as follow:我存储了比赛,以便按如下方式遍历所有内容:

    list_links = driver.find_elements_by_tag_name('a')
    hlink=[]
    for ii in list_links:
        hlink.append(ii.get_attribute("href"))
    sub="https://rotogrinders.com/resultsdb"
    con= "contest"
    contest_list=[]
    for text in hlink:
        if sub in text:
            if con in text:
                contest_list.append(text)
# Iterate through all the entries(user) of a contest and extract the information of the team entered by the user 

    for c in contest_list:
        driver.get(c)

Then, I want to extract all participants team entered in the contest and store it in a dataframe.然后,我想提取参加比赛的所有参与者团队并将其存储在数据框中。 I am able to do it successfully for the first page of the contest.我能够成功地完成比赛的第一页。

# Waits until tables are loaded and has text. Timeouts after 60 seconds
        while WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]'))):

# while ????: 

# Get tables to get the user names
            tables = pd.read_html(driver.page_source)
            users_df  = tables[0][['Rank','User']]
            users_df['User'] = users_df['User'].str.replace(' Member', '')

# Initialize results dataframe and iterate through users

            for i, row in users_df.iterrows():

                rank = row['Rank']
                user = row['User']

    # Find the user name and click on the name
                user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
                user_link.click()

    # Get the lineup table after clicking on the user name
                tables = pd.read_html(driver.page_source)
                lineup = tables[1]

    #print (user)
    #print (lineup)

    # Restructure to put into resutls dataframe
                lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
                lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']

                temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11), 
                columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )

                temp_df.insert(loc=0, column = 'User', value = user)
                temp_df.insert(loc=0, column = 'Rank', value = rank)
                temp_df["Date"]=d
                results = results.append(temp_df)
            #next_button = driver.find_elements_by_xpath("//button[@type='button']")
            #next_button[2].click()



            results = results.reset_index(drop=True)



driver.close()

However, there are other pages and to access it, you need to click on the small arrow next button at the bottom.但是,还有其他页面,要访问它,您需要单击底部的小箭头next button Moreover, you can click indefinitely on that button;此外,您可以无限期地单击该按钮; even if there are not more entries.即使没有更多条目。 Therefore, I would like to be able to loop through all pages with entries and stop when there are no more entries and change contest.因此,我希望能够遍历所有包含条目的页面,并在没有更多条目和更改竞赛时停止。 I try to implement a while loop to do so, but my code did not work...我尝试实现一个while循环来这样做,但我的代码不起作用......

You must really make sure that page loads completely before you do anything on that page.在对该页面执行任何操作之前,您必须确保该页面完全加载。

Moreover, it seems to work only for the first date in my calendar loop and always fails in the next iteration此外,它似乎只适用于我的日历循环中的第一个日期,并且在下一次迭代中总是失败

Usually when selenium loads a browser page it tries to look for the element even if it is not loaded all the way.通常当 selenium 加载浏览器页面时,它会尝试查找元素,即使它没有完全加载。 I suggest you to recheck the xpath of the element you are trying to click.我建议您重新检查您尝试单击的元素的xpath

Also try to see when the page loads completely and use time.sleep(number of seconds) to make sure you hit the element or you can check for a particular element or a property of element that would let you know that the page has been loaded.还尝试查看页面何时完全加载并使用time.sleep(number of seconds)确保您点击了元素,或者您可以检查特定元素或元素的属性,让您知道页面已加载.

One more suggestion is that you can use driver.current_url to see which page are you targetting.另一个建议是您可以使用driver.current_url来查看您的目标页面。 I have had this issue while i was working on multiple tabs and I had to tell python/selenium to manually switch to that tab我在处理多个选项卡时遇到了这个问题,我不得不告诉 python/selenium 手动切换到该选项卡

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM