Python: Get javascript file from href tag of html

Question

Consider a website similar to this one:

http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=1&allbin=3055311

As one can see, the website contains links to pdf files referenced by an href tag in the page source, eg:

<a href="javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();">B000114563.PDF</a>

I would like to open the underlying file using python, effectively scraping the results.

req = urllib2.Request("link.com")
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

links = []
for link in soup.findAll('a'):
    links.append(link.get("href"))

Normally I would just connect the base url with the href url to get the documents, but here, they are referenced with javascript. Hence I am not entirely sure how to access the files.

I would prefer to use urrlib2 and BeautifulSoup and not switch to Selenium to click on links. Does anyone have an idea to accomplish that? It would be greatly appreciated.

Answer 1

I downloaded few files and compared direct link with its name and all elements required in link you have in filename

Filename:

form_cofo_pdf_view_B000114563.PDF

Direct link:

http://a810-bisweb.nyc.gov/bisweb/CofoDocumentContentServlet
?passjobnumber=null
&cofomatadata1=cofo
&cofomatadata2=B
&cofomatadata3=000
&cofomatadata4=114000
&cofomatadata5=B000114563.PDF

So you can create direct link when you get filename from string javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();

Working code: http://pastebin.com/kt72GSyYa

Python: Get javascript file from href tag of html

Question

1 answers

solution1
0 2016-09-08 20:05:08

Python: Get javascript file from href tag of html

Question

1 answers

solution1 0 2016-09-08 20:05:08

solution1
0 2016-09-08 20:05:08