简体   繁体   English

如何在python中使用selenium从javascript渲染表中获取数据

[英]How to get data from javascript rendered table using selenium in python

I have a website to scrape and i am using selenium to do it.我有一个要抓取的网站,我正在使用 selenium 来完成它。 When i finished writing the code, i noticed that i was not getting output at all when i print the table contents.当我写完代码时,我注意到打印表格内容时根本没有得到输出。 I viewed the page source and then i found out that the table was not in the source.我查看了页面源,然后我发现该表不在源中。 That is why even i find the xpath of the table from inspect element i cant get any output of it.这就是为什么即使我从检查元素中找到表格的 xpath 我也无法获得它的任何输出。 Do someone know how could I get the response/data or just printing the table from the javascript response?有人知道我如何获得响应/数据或仅从 javascript 响应中打印表格吗? Thanks.谢谢。

Here is my current code这是我当前的代码

from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--incognito')
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)

driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=
%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')

#These line of codes is for selecting the desired search parameter from the combo box, you can disregard this since i was putting the whole url with params
input = driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID@SG-Mandatory")
driver.find_element_by_xpath('//*[@id="search-button-container"]/button').click()


table = driver.find_elements_by_xpath('//*[@id="refine-preview-content"]/table/tbody/tr/td')
for i in table:
     print(i) no output

I just want to scrape all the domain names like in the first result like 0 _ _ .sg我只想像第一个结果一样抓取所有域名,例如0 _ _ .sg

You can try the below code.你可以试试下面的代码。 After you have selected all the details options to fill and click on the search button it is kind of an implicit wait to make sure we get the full page source.在您选择要填写的所有详细信息选项并单击搜索按钮后,这是一种隐式等待,以确保我们获得完整的页面源。 Then we used the read_html from pandas which searches for any tables present in the html and returns a list of dataframe.然后我们使用 pandas 中的 read_html 来搜索 html 中存在的任何表并返回一个数据框列表。 we take the required df from there.我们从那里获取所需的 df。

from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
import pandas as pd

options = Options()
options.add_argument('--incognito')
chrome_path = r"C:/Users/prakh/Documents/PythonScripts/chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)

driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
#html = driver.page_source
#soup = BeautifulSoup(html,'lxml')

#These line of codes is for selecting the desired search parameter from the combo box
input = driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[@id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID@SG-Mandatory")
driver.find_element_by_xpath('//*[@id="search-button-container"]/button').click()

time.sleep(5)
html = driver.page_source
tables = pd.read_html(html)

df = tables[-1]
print(df)

If you are open to other ways does the following give the expected results?如果您对其他方式持开放态度,以下是否会给出预期的结果? It mimics the xhr the page does (though I have trimmed it down to essential elements only) to retrieve the lookup results.它模仿页面所做的 xhr(虽然我只将其修剪为基本元素)以检索查找结果。 Faster than using a browser.比使用浏览器更快。

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://reversewhois.domaintools.com/?ajax=mReverseWhois&call=ajaxUpdateRefinePreview&q=[[[%22whois%22,%222%22,%22VerifiedID@SG-Mandatory%22]]]&sf=true', headers=headers)
table = pd.read_html(r.json()['results'])
print(table)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM