How to get href from <a> tag which contains JavaScript using Python?

Question

I am trying to get href from a tag using Python + Selenium, but the href is having "JavaScript" in it. So I am unable to get the target URL.

I am using Python 3.7.3 , selenium 3.141.0 .

HTML:

<a href="javascript:GoPDF('FS1546')" style="TEXT-DECORATION: Underline">Aberdeen Standard Wholesale Australian Fixed Income</a>

Code:

from selenium import webdriver
driver = webdriver.Chrome("chromedriver.exe")
driver.get("http://www.colonialfirststate.com.au/Price_performance/performanceNPrice.aspx?menutabtype=performance&CompanyCode=001&Public=1&MainGroup=IF&BrandName=FC&ProductIDs=91&Product=FirstChoice+Wholesale+Investments&ACCodes=&ACText=&SearchType=Performance&Multi=False&Hedge=False&IvstType=Investment+products&IvstGroup=&APIR=&FundIDs=&FundName=&FundNames=&SearchProdIDs=&Redirect=1")
print(driver.find_elements_by_xpath("tbody/tr[5]/td[1]/a")

what I need is the target URL as:

https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

but its giving me:

javascript:GoPDF('FS2311')

Answer 1

I checked the PDF url from the popup and found out how they are generating the URL.

They use file name (ex. FS2065) to generate the PDF URL.

The url of the PDF look like this, https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/0/fs2065.pdf?3

For all PDFs up to this part, it is having the same path

https://www3.colonialfirststate.com.au/content/dam/prospects/

After that part we have a path generated using the fileID,

fs/2/0/fs2065.pdf?3
 | | |     |     ||
 | | |     |     ++--- Not needed (But you can keep if you want)
 | | |     |
 | | |     +---- File Name
 | | +---------- 4th character in the file name 
 | +------------ 3rd character in the file name 
 +-------------- First two characters in the file name

We can use this as a workaround to get the exact url.

url = "javascript:GoPDF('FS2311')" # javascript URL  

pdfFileId = url[18:-2].lower() # extracts the file name from the Javascript URL

pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3"%(pdfFileId[:2],pdfFileId[2],pdfFileId[3],pdfFileId) 

print(pdfBaseUrl)
# prints https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3

See it in action here .

Answer 2

Kudos to the accepted answer for doing the background work.

I'd recommend using urllib.parse facilities from the standard library. URLs are not as straightforward as they first appear and the guys who wrote urllib are experts on the URL standard, RFC 808 .

The following code looks more complicated at first sight but:

it delegates the tricky (and potentially brittle) URL stuff to urllib . It is therefore unlikely to suffer from false assumptions about the structure of URLs (eg that the query string is 1-digit long, as in the accepted answer.)
it uses more robust ways to extract the GoPDF fileId and the invariant part of the url. String slicing based on character position is likely to break as soon as any small detail changes.

from urllib.parse import urlparse, urlunparse


def build_pdf_url(model_url, js_href):
    url = urlparse(model_url)
    pdf_fileid = get_fileid_from_js_href(js_href)
    pdf_path = build_pdf_path(model_url, pdf_fileid)
    return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
                      url.query, url.fragment))


def get_fileid_from_js_href(href):
    """extract fileid by extracting text between single quotes"""
    return href.split("'")[1].lower()


def build_pdf_path(url, pdf_fileid):
    prefix = pdf_fileid[:2]
    major_version = pdf_fileid[2]
    minor_version = pdf_fileid[3]
    filename = pdf_fileid + '.pdf'
    return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])


def invariant_path(url, dropped_components=4):
    """
    return all but the dropped components of the URL 'path'
    NOTE: path components are separated by '/'
    """
    path_components = urlparse(url).path.split('/')
    return '/'.join(path_components[:-dropped_components])


js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))


$ python urlbuild.py
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

How to get href from <a> tag which contains JavaScript using Python?

Question

2 answers

solution1
3 ACCPTED 2019-09-11 07:03:27

solution2
0 2019-09-12 02:41:17

How to get href from <a> tag which contains JavaScript using Python?

Question

2 answers

solution1 3 ACCPTED 2019-09-11 07:03:27

solution2 0 2019-09-12 02:41:17

solution1
3 ACCPTED 2019-09-11 07:03:27

solution2
0 2019-09-12 02:41:17