Consider a website similar to this one:
http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=1&allbin=3055311
As one can see, the website contains links to pdf files referenced by an href tag in the page source, eg:
<a href="javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();">B000114563.PDF</a>
I would like to open the underlying file using python, effectively scraping the results.
req = urllib2.Request("link.com")
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
links = []
for link in soup.findAll('a'):
links.append(link.get("href"))
Normally I would just connect the base url with the href url to get the documents, but here, they are referenced with javascript. Hence I am not entirely sure how to access the files.
I would prefer to use urrlib2 and BeautifulSoup and not switch to Selenium to click on links. Does anyone have an idea to accomplish that? It would be greatly appreciated.
I downloaded few files and compared direct link with its name and all elements required in link you have in filename
Filename:
form_cofo_pdf_view_B000114563.PDF
Direct link:
http://a810-bisweb.nyc.gov/bisweb/CofoDocumentContentServlet
?passjobnumber=null
&cofomatadata1=cofo
&cofomatadata2=B
&cofomatadata3=000
&cofomatadata4=114000
&cofomatadata5=B000114563.PDF
So you can create direct link when you get filename from string javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();
Working code: http://pastebin.com/kt72GSyYa
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.