I am having a difficult time in scraping contents of a web page.
To explain this here's my Python code:
response = requests.post('http://a836-acris.nyc.gov/bblsearch/bblsearch.asp?borough=1&block=733&lot=66',{'User-Agent' : 'Mozilla/5.0'})
This gives me an HTML page containing a form(not containing the final page):
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Untitled Page</title>
</head>
<body>
<form name="bbldata" action="https://a836-acris.nyc.gov/DS/DocumentSearch/BBLResult" method="post">
<input type="hidden" name="hid_borough" value="1"/>
<input type="hidden" name="hid_borough_name" value="MANHATTAN / NEW YORK" />
<input type="hidden" name="hid_block" value="733"/>
<input type="hidden" name="hid_block_value" value="733"/>
<input type="hidden" name="hid_lot" value="66"/>
<input type="hidden" name="hid_lot_value" value="66"/>
<input type="hidden" name="hid_unit" value=""/>
<input type="hidden" name="hid_selectdate" value=""/>
<INPUT TYPE="HIDDEN" NAME="hid_datefromm" VALUE="">
<INPUT TYPE="HIDDEN" NAME="hid_datefromd" VALUE="">
<INPUT TYPE="HIDDEN" NAME="hid_datefromy" VALUE="">
<INPUT TYPE="HIDDEN" NAME="hid_datetom" VALUE="">
<INPUT TYPE="HIDDEN" NAME="hid_datetod" VALUE="">
<INPUT TYPE="HIDDEN" NAME="hid_datetoy" VALUE="">
<input type="hidden" name="hid_doctype" value=""/>
<input type="hidden" name="hid_doctype_name" value="All Document Classes"/>
<input type="hidden" name="hid_max_rows" value="10"/>
<input type="hidden" name="hid_page" value="1" />
<input type="hidden" name="hid_ReqID" value=""/>
<input type="hidden" name="hid_SearchType" value="BBL"/>
<input type="hidden" name="hid_ISIntranet" value="N"/>
<input type="hidden" name="hid_sort" value=""/>
</form>
<script language="JavaScript">
document.bbldata.submit();
</script>
</body>
</html>
However if in the browser you enter this url you ultimately get this webpage after the script in the HTML has been loaded, and this has to be scraped:
Any help will be appreciated!
The HTML table in your example is showing the data you need to post. As I think you're aware, the URL you're using is acutally the referer . So, you need to:
# 1. Create a payload
payload = {
'hid_borough': 1,
'hid_borough_name': 'MANHATTAN / NEW YORK',
'hid_block': 733,
'hid_block_value': 733,
'hid_lot': 66,
'hid_lot_value': 66,
'hid_doctype_name': 'All Document Classes',
'hid_max_rows': 10,
'hid_page': 1,
'hid_SearchType': 'BBL',
'hid_ISIntranet': 'N'
}
# 2. Add the correct referer to your headers
header = {'User-Agent': 'Mozilla/5.0',
'referer': 'http://a836-acris.nyc.gov/bblsearch/bblsearch.asp?borough=1&block=733&lot=66'}
# 3. Add payload and headers to the post
redirect = 'https://a836-acris.nyc.gov/DS/DocumentSearch/BBLResult'
result = requests.post(redirect, data=payload, headers=header)
print result.url
https://a836-acris.nyc.gov/DS/DocumentSearch/BBLResult
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.