简体   繁体   English

在python中使用硒从动态网站获取数据:如何发现数据库查询的完成方式?

[英]Using selenium in python to get data from dynamic website: how to discover the way databases querys are done?

I had some experience with coding before, but not specifically for web applications. 我以前有编码方面的经验,但并非专门针对Web应用程序。 I have been tasked with getting data from this website: http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/ 我的任务是从以下网站获取数据: http : //www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos / precos-referenciais /类群-referenciais-BM-FBOVESPA /

They are avaliable on a day-to-day basis. 它们每天都可用。 I have used selenium in Python, and so far the results are good: I can get the entire table, store it in a pandas dataframe, and then to a mysql database and stuff. 我已经在Python中使用了硒,到目前为止效果很好:我可以获取整个表,将其存储在pandas数据框中,然后存储到mysql数据库中。 The problem is: the result from the website is always the same! 问题是:网站的结果总是一样的!

Here is my code: 这是我的代码:

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
def GetDataFromWeb(day, month, year):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.send_keys("/".join((str(day),str(month),str(year))))
date = driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(5)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

The table resulting from this function is always the same, no matter what inputs I send to it. 无论我发送给它什么输入,此函数生成的表始终是相同的。 And they seem to be from the corresponding date of 06/09/2018 (month=09,day=06). 并且它们似乎是从相应的日期06/09/2018(月= 09,天= 06)开始的。 I think the main problem is that I don't know how the queries to their database is done, so this always runs like a "default date". 我认为主要的问题是我不知道如何查询其数据库,因此它始终像“默认日期”一样运行。 I have read some people talking about Ajax and JavaScript requests, but I don't know if that's the case. 我读过一些人在谈论Ajax和JavaScript请求,但是我不知道是不是这种情况。 How can I tell? 我怎么知道?

This code will work(updated few lines in your code) 该代码将起作用(代码中更新了几行)

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
def GetDataFromWeb(day, month, year):

***#to avoid data error in date handler***
if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.clear() ***#to clear auto populated data***
date.send_keys(((str(day),str(month),str(year)))) ***# removed the join part***

driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(50)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

print GetDataFromWeb(3,9,2018)

It will print the matching data for the required date. 它将打印所需日期的匹配数据。

I have added #to avoid data error in date handler 我添加了#以避免日期处理程序中的数据错误

if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

date.clear() #to clear auto populated data date.send_keys(((str(day),str(month),str(year)))) # removed the join part date.clear() #清除自动填充的数据 date.send_keys(((str(day),str(month),str(year)))) #删除了连接部分

Note the problem in your code was the date& month fields take two digit number and date.send_keys("/".join((str(day), str(month), str(year)))) line was generating an error because of which the system date was picked and you always see same data coming for any input data. 请注意,代码中的问题是date&month字段采用两位数字和date.send_keys("/".join((str(day), str(month), str(year))))行生成错误,因为其中选择了系统日期,并且对于任何输入数据,您始终会看到相同的数据。 Also when you click on the date it was picking default date so first, we have to clear that and send custom date. 同样,当您单击日期时,它正在选择默认日期,因此,首先,我们必须清除该日期并发送自定义日期。 Hope this helps 希望这可以帮助


Update for additional query: Add these imports 更新其他查询:添加这些导入

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Add this line in place of wait 添加此行代替等待

WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p')))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM