简体   繁体   English

如何使用Selenium解析网站上的表格内容?

[英]How can I parse the table content from the website using Selenium?

I'm trying to parse the tables present in sports website into list of dictionary to render into template, this is my first exposure to selenium, I tried to read selenium documentation and wrote this program 我正在尝试将体育网站中的表格解析为字典列表以呈现为模板,这是我第一次接触硒,我尝试阅读硒文档并编写了该程序

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))

browser.close()
browser.quit()

I'm getting value as 0 and none, How can I modify to get all the values of table and store it in a list of dictionary?, If you have any other questions feel free to ask. 我得到的值是0而没有,我如何修改以获取表的所有值并将其存储在字典列表中?,如果您有任何其他问题,请随时提出。

First of all, avoid using time.sleep() . 首先,避免使用time.sleep() It is against all best practices. 这违反了所有最佳做法。 Use an Explicit Wait . 使用显式等待

If you inspect the table, you can see that it is location inside the <iframe> tag with name="testbat" . 如果检查该表,则可以看到它位于<iframe>标记内, name="testbat" So, you'll have to switch to that frame in order to get the contents of the table. 因此,您必须切换到该框架才能获取表的内容。 It can be done like this: 可以这样完成:

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

After switching the frame, use the Explicit Wait as mentioned above. 切换帧后,使用上述“显式等待”。

Complete code: 完整的代码:

from bs4 import BeautifulSoup
from selenium import webdriver

# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

try:
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
    pass  # Handle the time out exception

html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")

You can check whether you've got the table: 您可以检查是否有桌子:

>>> print('S.P.D. Smith' in html)
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Selenium从网站解析表数据? - How can I parse table data from website using Selenium? 如何使用 python + selenium 从 div 中提取内容 - How can I extract the content from a div using python + selenium 如何使用BeautifulSoup从特定字符串解析表? - How can I parse a table from a specific string using BeautifulSoup? 如何从该表格内容中获取链接(我猜它是 javascript)? (不含硒) - How can I get the link from this table content (I guess it's javascript) ? (Without selenium) 如何使用硒从网站中提取所有动态表数据? - How to extract all dynamic table data from a website using selenium? 如何使用 selenium 和 python 从网站获取工具提示文本,其中文本来自 javascript - How can I get the tooltip text from a website using selenium and python where the text comes from a javascript 如何使用 Selenium 获取动态网站内容? - How to get the dynamic website content using Selenium? 我想使用 selenium 从表中获取内容到数组 - I want to get content from table to array by using selenium 如何使用 selenium 浏览器生成的 html 内容将动态网站内容发送到 scrapy? - How can I send Dynamic website content to scrapy with the html content generated by selenium browser? 如何使用“包含”从具有 Selenium 的网站获取多个 Xpath? - How can I get multiple Xpaths from a website with Selenium using “contains”?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM