使用 Selenium 登录后使用 Pandas 和 Beautiful Soup 从表中收集数据

Question

I'm trying to scrape data from a paginated table.我正在尝试从分页表中抓取数据。 The table can only be accessed by logging in to a user account.只能通过登录用户帐户访问该表。 I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame.我决定使用 Selenium 登录来解决此问题。然后我希望能够将其读入 Pandas DataFrame。 I plan on using BeautifulSoup as a go between.我计划使用 BeautifulSoup 作为 go 之间。

Here is my code:这是我的代码：

from selenium import webdriver
import time
import pandas as pd

from bs4 import BeautifulSoup

url = "https://www.example.com/userarea"

driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)

username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz@email.com")

password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')

driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)

driver.find_element_by_xpath('//span[text()="Text"]').click()

driver.find_element_by_xpath('//span[text()="Text"]').click()

html = driver.page_source
soup = BeautifulSoup(html,'html.parser') 

try:
    tables = soup.find_all('th')
    print(tables) #Returns an empty list
    df = pd.read_html(str(tables))

    df.head()

except:
    driver.close()
driver.close()

Unfortunately, this is only printing an empty list.不幸的是，这只是打印一个空列表。 I've tried using lxml too but no joy.我也尝试过使用 lxml 但没有乐趣。 Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present).使用检查工具，似乎没有任何表格标签，所以我试图找到所有<th>标签（肯定存在）。 Again no joy.再次没有喜悦。 I've not yet tried to work through the individual pages.我还没有尝试浏览各个页面。 I only mention the pagination in case it offers a clue to the issue.我只提到分页，以防它为问题提供线索。

Any idea what I'm missing?知道我错过了什么吗？

Answer 1

Thank you to those that offered suggestions.感谢那些提供建议的人。 In the end furas' suggestion was best placed and it turned out the script was running too quickly.最后 furas 的建议是最好的，结果证明脚本运行得太快了。 I paused Python for 6 seconds after clicking on the page with the table on.单击带有表格的页面后，我将 Python 暂停了 6 秒钟。 Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.似乎在 javascript 上运行，当脚本通过分页工作时，我实际上可以看到这些值现在弹出到位。

import time

#Navigate to page, then let it load using:

time.sleep(6)

使用 Selenium 登录后使用 Pandas 和 Beautiful Soup 从表中收集数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-28 16:39:38

使用 Selenium 登录后使用 Pandas 和 Beautiful Soup 从表中收集数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-28 16:39:38

解决方案1
0 已采纳 2021-04-28 16:39:38