网址不变时，使用Selenium在多个页面上抓取表格

Question

我一直在尝试编写一个程序，以从www.whoscored.com抓取统计信息，并创建一个熊猫数据框。

我已经在crookedleaf的帮助下更新了代码，这是工作代码：

import time
import pandas as pd
from pandas.io.html import read_html
from pandas import DataFrame
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017')

summary_stats = DataFrame()

while True:

    while driver.find_element_by_xpath('//*[@id="statistics-table-summary"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or
        time.sleep(1)

    table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]')
    table_html = table.get_attribute('innerHTML')
    page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value')
    print('Page ' + page_number)
    df1 = read_html(table_html)[0]
    summary_stats = pd.concat([summary_stats, df1])
    next_link = driver.find_element_by_xpath('//*[@id="next"]')

    if 'disabled' in next_link.get_attribute('class'):
        break

    next_link.click()

print(summary_stats)

driver.close()

现在，我试图从其他选项卡中收集统计信息。 我真的很接近，但是当应该中断时，代码并没有退出循环。 这是下面的代码：

defensive_button = driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]/li[2]/a')
defensive_button.click()

defensive_stats = DataFrame()

while True:

    while driver.find_element_by_xpath('//*[@id="statistics-table-defensive"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or
        time.sleep(1)

    table = driver.find_element_by_xpath('//*[@id="statistics-table-defensive"]')
    table_html = table.get_attribute('innerHTML')
    page_number = driver.find_element_by_xpath('//*[@id="statistics-paging-defensive"]/div/input[1]').get_attribute('value')
    print('Page ' + page_number)
    df2 = read_html(table_html)[0]
    defensive_stats = pd.concat([defensive_stats, df2])
    next_link = driver.find_element_by_xpath('//*[@id="statistics-paging-defensive"]/div/dl[2]/dd[3]')

    if 'disabled' in next_link.get_attribute('class'):
        break

    next_link.click()

print(defensive_stats)

这段代码遍历所有页面，但随后不断遍历最后一页

Answer 1

您正在循环之外定义表的代码。 您正在导航到下一页，但没有重新定义table和table_html元素。 while True之后将它们移到第一行

编辑：对您的代码进行更改后，我的猜测是由于表的动态加载的内容，由于“正在加载”图形叠加，您无法处理更改或无法获取内容。 另一件事是不一定总是有30页。 例如，今天有29个，因此它不断从第29页获取数据。我修改了代码以使其继续运行，直到不再启用“下一个”按钮为止，然后我等待检查表是否正在加载，然后继续：

import time
from pandas.io.html import read_html
from pandas import DataFrame
from selenium import webdriver

driver = webdriver.Chrome(path-to-your-chromedriver)
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017')

df = DataFrame()

while True:

    while driver.find_element_by_xpath('//*[@id="statistics-table-summary"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or
        time.sleep(1)

    table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]')
    table_html = table.get_attribute('innerHTML')
    page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value')
    print('Page ' + page_number)
    df1 = read_html(table_html)[0]
    df.append(df1)
    next_link = driver.find_element_by_xpath('//*[@id="next"]')

    if 'disabled' in next_link.get_attribute('class'):
        break

    next_link.click()


print(df)

driver.close()

但是，我在运行此结束时得到一个空的DataFrame 。 不幸的是，我对pandas还不够熟悉，无法确定问题所在，但这与df.append() 。 我用它遍历了每个循环打印df1的值，并打印了正确的数据，但是没有将其添加到DataFrame 。 您可能已经熟悉了一些内容，以实现完全运行它所需的更改。

编辑2：花了我一段时间来解决这个问题。 本质上，页面内容是使用javascript动态加载的。 您声明的“下一个”元素仍然是您遇到的第一个“下一个”按钮。 每次您单击一个新选项卡时，“下一个”元素的数量都会增加。 我添加了一个可以在所有选项卡中成功导航的编辑（“详细”选项卡除外...希望您不需要这个大声笑）。 我，但是，仍然是空的DataFrame()的

import time
import pandas as pd
from pandas.io.html import read_html
from pandas import DataFrame
from selenium import webdriver

driver = webdriver.Chrome('/home/mdrouin/Downloads/chromedriver')
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017')

statistics = {  # this is a list of all the tabs on the page
    'summary': DataFrame(),
    'defensive': DataFrame(),
    'offensive': DataFrame(),
    'passing': DataFrame()
}

count = 0
tabs = driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]').find_elements_by_tag_name('li')  # this pulls all the tab elements
for tab in tabs[:-1]:  # iterate over the different tab sections
    section = tab.text.lower()
    driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]').find_element_by_link_text(section.title()).click()  # clicks the actual tab by using the dictionary's key (.proper() makes the first character in the string uppercase)
    time.sleep(3)
    while True:
        while driver.find_element_by_xpath('//*[@id="statistics-table-%s"]' % section).get_attribute('class') == 'is-updating':  # string formatting on the xpath to change for each section that is iterated over
            time.sleep(1)

        table = driver.find_element_by_xpath('//*[@id="statistics-table-%s"]' % section)  # string formatting on the xpath to change for each section that is iterated over
        table_html = table.get_attribute('innerHTML')
        df = read_html(table_html)[0]
        # print df
        pd.concat([statistics[section], df])
        next_link = driver.find_elements_by_xpath('//*[@id="next"]')[count]  # makes sure it's selecting the correct index of 'next' items 
        if 'disabled' in next_link.get_attribute('class'):
            break
        time.sleep(5)
        next_link.click()
    count += 1


for df in statistics.values():  # iterates over the DataFrame() elemnts
    print df

driver.quit()

网址不变时，使用Selenium在多个页面上抓取表格

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-03-06 23:26:47

网址不变时，使用Selenium在多个页面上抓取表格

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-03-06 23:26:47

解决方案1
0 已采纳 2017-03-06 23:26:47