[英]Not able to scrape data from dropdown box
In the following website " http://www.msamb.com/apmcpri_rpt.aspx ". 在以下网站“ http://www.msamb.com/apmcpri_rpt.aspx ”中。
The output change every time I click on an element in a dropdown but the url remains same. 每次单击下拉菜单中的元素时,输出都会更改,但URL保持不变。 It is calling a java script if the value of the drop down changes. 如果下拉菜单的值更改,它将调用Java脚本。 I tracked the Network and checked the request headers and form key-values and used it in postman. 我跟踪了网络并检查了请求标头和表单键值,并在邮递员中使用了它。 But it is returning the same page every time(" http://www.msamb.com/apmcpri_rpt.aspx " with nothing selected in dropdown). 但是它每次都返回相同的页面(“ http://www.msamb.com/apmcpri_rpt.aspx ”,下拉菜单中未选择任何内容)。
Can someone please help in scraping this site? 有人可以帮忙抓取这个网站吗?
There is a POST request sent each time you select an item from the dropdown. 每次您从下拉列表中选择一个项目时,都会发送一个POST请求。 Simulate it in your code. 在您的代码中模拟它。 requests
package would help maintaining your web-scraping session. requests
包将有助于维持您的网络抓取会话。 Sample code: 样例代码:
from bs4 import BeautifulSoup
import requests
apmc = 'JALGAON'
url = 'http://www.msamb.com/apmcpri_rpt.aspx'
with requests.Session() as session:
session.headers = {
'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'X-Requested-With': 'XMLHttpRequest'
}
response = session.get(url)
soup = BeautifulSoup(response.content)
# build an options mapping
options = {option.get_text(strip=True): option['value'] for option in soup.select("select#cpMainContent_cmb_comm option")[1:]}
# parse form parameters
form = soup.find("form", id="form1")
params = {
'ctl00$cpMainContent$cmb_comm': options.get(apmc),
'__ASYNCPOST': 'true',
'ctl00$cpMainContent$ScriptManager1': 'ctl00$cpMainContent$UpdatePanel1|ctl00$cpMainContent$cmb_comm',
'__EVENTTARGET': 'ctl00$cpMainContent$cmb_comm',
'__EVENTARGUMENT': form.find('input', {'name': '__EVENTARGUMENT'})['value'],
'__LASTFOCUS': '',
'__VIEWSTATE': form.find('input', {'name': '__VIEWSTATE'})['value'],
'__VIEWSTATEGENERATOR': form.find('input', {'name': '__VIEWSTATEGENERATOR'})['value'],
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': form.find('input', {'name': '__EVENTVALIDATION'})['value']
}
response = session.post(url, data=params)
# parse the results
soup = BeautifulSoup(response.content)
for row in soup.select("table#cpMainContent_GridView1_tab5 tr")[1:]:
print row.find_all("td")[1].text
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.