繁体   English   中英

在python中使用硒从动态网站获取数据:如何发现数据库查询的完成方式?

[英]Using selenium in python to get data from dynamic website: how to discover the way databases querys are done?

我以前有编码方面的经验,但并非专门针对Web应用程序。 我的任务是从以下网站获取数据: http : //www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos / precos-referenciais /类群-referenciais-BM-FBOVESPA /

它们每天都可用。 我已经在Python中使用了硒,到目前为止效果很好:我可以获取整个表,将其存储在pandas数据框中,然后存储到mysql数据库中。 问题是:网站的结果总是一样的!

这是我的代码:

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
def GetDataFromWeb(day, month, year):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.send_keys("/".join((str(day),str(month),str(year))))
date = driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(5)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

无论我发送给它什么输入,此函数生成的表始终是相同的。 并且它们似乎是从相应的日期06/09/2018(月= 09,天= 06)开始的。 我认为主要的问题是我不知道如何查询其数据库,因此它始终像“默认日期”一样运行。 我读过一些人在谈论Ajax和JavaScript请求,但是我不知道是不是这种情况。 我怎么知道?

该代码将起作用(代码中更新了几行)

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
def GetDataFromWeb(day, month, year):

***#to avoid data error in date handler***
if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.clear() ***#to clear auto populated data***
date.send_keys(((str(day),str(month),str(year)))) ***# removed the join part***

driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(50)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

print GetDataFromWeb(3,9,2018)

它将打印所需日期的匹配数据。

我添加了#以避免日期处理程序中的数据错误

if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

date.clear() #清除自动填充的数据 date.send_keys(((str(day),str(month),str(year)))) #删除了连接部分

请注意,代码中的问题是date&month字段采用两位数字和date.send_keys("/".join((str(day), str(month), str(year))))行生成错误,因为其中选择了系统日期,并且对于任何输入数据,您始终会看到相同的数据。 同样,当您单击日期时,它正在选择默认日期,因此,首先,我们必须清除该日期并发送自定义日期。 希望这可以帮助


更新其他查询:添加这些导入

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

添加此行代替等待

WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p')))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM