简体   繁体   English

使用Python刮取下拉菜单+按钮

[英]Scraping with drop down menu + button using Python

I'm trying scrape data from Mexico's Central Bank website but have hit a wall. 我正在尝试从墨西哥中央银行的网站上抓取数据,但遇到了麻烦。 In terms of actions, I need to first access a link within an initial URL. 在操作方面,我需要首先访问初始URL中的链接。 Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. 访问链接后,我需要选择2个下拉值,然后点击激活提交按钮。 If all goes well, I will be taken to a new url where a set of links to pdfs are available. 如果一切顺利,我将被带到一个新的URL,其中提供了一组指向pdf的链接。

The original url is: 原始网址是:

" http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html " http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html

The nested URL (the one with the dropbox) is: " http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX " 嵌套的URL(带有保管箱的URL)为:“ http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX

The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'. 输入(任意)为:'07 / 03/2019'和'14 / 03/2019'。

Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links. 使用BeautifulSoup和请求,我觉得我可以填写下拉列表中的值,但是无法单击按钮并使用链接列表获得最终URL。

My code follows below : 我的代码如下:

from bs4 import BeautifulSoup
import requests

pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)

In the code, respo.url is equal to url...the code fails. 在代码中,respo.url等于url ...代码失败。 Can anybody pls help me identify where the problem is? 有人可以帮助我确定问题出在哪里吗? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. 我是刮板的新手,所以这很明显-为此事向您道歉...我将不胜感激。 Thanks! 谢谢!

Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. 上次我检查时,无法通过使用BeautifulSoup和Python单击按钮来提交表单。 There are typically two approaches I often see: 我通常会看到两种方法:

  1. Reverse engineer the form 逆向工程表格

If the form makes AJAX calls (eg makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. 如果表单进行AJAX调用(例如,在幕后发出请求,这对于以React或Angular编写的SPA来说很常见),那么最好的方法是使用Chrome或其他浏览器中的网络请求标签来了解端点是什么以及有效载荷是。 Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (eg manually do what the form is doing behind the scenes). 获得这些答案后,您可以使用data=your_payload_dictionary端点发出带有requests库的POST请求(例如,手动执行表单在后台执行的操作)。 Read this post for a more elaborate tutorial. 阅读这篇文章以获得更详尽的教程。

  1. Use a headless browser 使用无头浏览器

If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. 如果网站是使用ASP.NET或类似的MVC框架编写的,则最好的方法是使用无头浏览器填写表单,然后单击提交。 A popular framework for this is Selenium . 一个流行的框架是Selenium This simulates a normal browser. 这模拟了普通的浏览器。 Read this post for a more elaborate tutorial. 阅读这篇文章以获得更详尽的教程。

Judging by a cursory look at the page you're working on, I recommend approach #2. 从粗略地看待您正在处理的页面来看,我建议采用方法2。

The page you have to scrape is: 您必须抓取的页面是:

http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces

Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers 在有效载荷和RefererUser-Agent以及请求标头中的所有旧内容中添加cookie的查询日期和JSESSIONID

Example: 例:

import requests
import pandas as pd

cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"


payload = {
    "JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
    "fechaAConsultar": "21/03/2019"
}

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)

When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests . 当仅单击页面时,似乎正在发生某种cookie /会话的事情,而在使用requests时可能难以考虑。

(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000 ) (示例: http : //www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000

It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). 使用selenium编写代码可能会更容易,因为这将使浏览器自动化(并照顾所有标头和其他内容)。 You'll still have access to the html to be able to scrape what you need. 您仍然可以访问html以便抓取所需的内容。 You can probably reuse a lot of what you're doing as well in selenium . selenium您可能还可以重用很多您正在做的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM