BS4和抓取更新表

Question

我正在尝试从whoscored.com（下面的变量根目录中的链接）抓取到所有EPL播放器的链接，这是代码：

from bs4 import BeautifulSoup
from selenium import webdriver
root = "https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017"
driver = webdriver.PhantomJS()
driver.get(root)
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
players = soup.find("div", {'id':'statistics-table-summary'})

print(players)

如果进入该页面，您将看到一个玩家列表和一个下一步按钮，以显示接下来的10个玩家（其中29页中有284个）我想要的输出：保存指向每个十个玩家配置文件的链接，然后转到下一个页面，接下来的十个玩家直到完成

为此，我以为我会soup.find_all('a',{'class':'player-link})因为播放器的链接和名称都在这样的容器中，但此操作不会返回任何内容。 所以我想我会首先找到所有在那里的桌子，但这也没有返回。 对此有何看法？ 先感谢您

Answer 1

在获取.page_source之前，您需要等待表被加载：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ...

driver.get(root)

# wait for at least one player to be present in the statistics table
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#statistics-table-summary .player-link")))

page = driver.page_source
driver.close()

# ...

BS4和抓取更新表

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-01-04 18:02:04

BS4和抓取更新表

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-01-04 18:02:04

解决方案1
2 已采纳 2017-01-04 18:02:04