[英]How can I parse the table content from the website using Selenium?
I'm trying to parse the tables present in sports website into list of dictionary to render into template, this is my first exposure to selenium, I tried to read selenium documentation and wrote this program 我正在尝试将体育网站中的表格解析为字典列表以呈现为模板,这是我第一次接触硒,我尝试阅读硒文档并编写了该程序
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))
browser.close()
browser.quit()
I'm getting value as 0 and none, How can I modify to get all the values of table and store it in a list of dictionary?, If you have any other questions feel free to ask. 我得到的值是0而没有,我如何修改以获取表的所有值并将其存储在字典列表中?,如果您有任何其他问题,请随时提出。
First of all, avoid using time.sleep()
. 首先,避免使用time.sleep()
。 It is against all best practices. 这违反了所有最佳做法。 Use an Explicit Wait . 使用显式等待 。
If you inspect the table, you can see that it is location inside the <iframe>
tag with name="testbat"
. 如果检查该表,则可以看到它位于<iframe>
标记内, name="testbat"
。 So, you'll have to switch to that frame in order to get the contents of the table. 因此,您必须切换到该框架才能获取表的内容。 It can be done like this: 可以这样完成:
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
After switching the frame, use the Explicit Wait as mentioned above. 切换帧后,使用上述“显式等待”。
Complete code: 完整的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
pass # Handle the time out exception
html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")
You can check whether you've got the table: 您可以检查是否有桌子:
>>> print('S.P.D. Smith' in html)
True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.