简体   繁体   中英

Python: Get javascript file from href tag of html

Consider a website similar to this one:

http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=1&allbin=3055311

As one can see, the website contains links to pdf files referenced by an href tag in the page source, eg:

<a href="javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();">B000114563.PDF</a>

I would like to open the underlying file using python, effectively scraping the results.

req = urllib2.Request("link.com")
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

links = []
for link in soup.findAll('a'):
    links.append(link.get("href"))

Normally I would just connect the base url with the href url to get the documents, but here, they are referenced with javascript. Hence I am not entirely sure how to access the files.

I would prefer to use urrlib2 and BeautifulSoup and not switch to Selenium to click on links. Does anyone have an idea to accomplish that? It would be greatly appreciated.

I downloaded few files and compared direct link with its name and all elements required in link you have in filename

Filename:

form_cofo_pdf_view_B000114563.PDF

Direct link:

http://a810-bisweb.nyc.gov/bisweb/CofoDocumentContentServlet
?passjobnumber=null
&cofomatadata1=cofo
&cofomatadata2=B
&cofomatadata3=000
&cofomatadata4=114000
&cofomatadata5=B000114563.PDF

So you can create direct link when you get filename from string javascript:$('form_cofo_pdf_view_B000114563.PDF').submit();

Working code: http://pastebin.com/kt72GSyYa

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM