简体   繁体   中英

How can I extract redirected url with Python without using requests module and via xpath?

The html-page incudes the following script:

<script>
const url = 'REQUIRED LINK';
window.location.href = url + window.location.search;
</script>

This is the only place in page, where the link is. I don't know Java at all.
I tried extract this way:

page_2 = requests.get(link).content.decode('UTF-8')
html_tree = html.fromstring(page_2)

inside_scripts = html_tree.xpath("//script[contains(@text, 'url')]")

But it returns empty list.

Let's suppose const url = 'REQUIRED LINK'; always uses the same formatting, including spaces.

You could run the following code - using regex - to extract 'REQUIRED LINK'

Javascript:

const regex = /(?<=const url = ').+(?=';)/gm;

var required_link = YOUR_HTML_STRING.match(regex);

Python:

import re

regex = r"(?<=const url = ').+(?=';)"

require_link = re.findall(regex, HTML_STRING)[0]

you should use:

inside_scripts = html_tree.xpath("//script[contains(., 'url')]")

One liner to extract it with XPath 1.0:

print(html_tree.xpath('substring-after(substring-before(//script[contains(.,"const url")],"';"),"= '")'))

Output: REQUIRED LINK

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM