How can I extract redirected url with Python without using requests module and via xpath?

Question

The html-page incudes the following script:

<script>
const url = 'REQUIRED LINK';
window.location.href = url + window.location.search;
</script>

This is the only place in page, where the link is. I don't know Java at all.
I tried extract this way:

page_2 = requests.get(link).content.decode('UTF-8')
html_tree = html.fromstring(page_2)

inside_scripts = html_tree.xpath("//script[contains(@text, 'url')]")

But it returns empty list.

Answer 1

Let's suppose const url = 'REQUIRED LINK'; always uses the same formatting, including spaces.

You could run the following code - using regex - to extract 'REQUIRED LINK'

Javascript:

const regex = /(?<=const url = ').+(?=';)/gm;

var required_link = YOUR_HTML_STRING.match(regex);

Python:

import re

regex = r"(?<=const url = ').+(?=';)"

require_link = re.findall(regex, HTML_STRING)[0]

Answer 2

you should use:

inside_scripts = html_tree.xpath("//script[contains(., 'url')]")

Answer 3

One liner to extract it with XPath 1.0:

print(html_tree.xpath('substring-after(substring-before(//script[contains(.,"const url")],"';"),"= '")'))

Output: REQUIRED LINK

How can I extract redirected url with Python without using requests module and via xpath?

Question

3 answers

solution1
2 2020-05-15 15:18:13

solution2
1 ACCPTED 2020-05-15 15:26:57

solution3
0 2020-05-15 15:48:03

How can I extract redirected url with Python without using requests module and via xpath?

Question

3 answers

solution1 2 2020-05-15 15:18:13

solution2 1 ACCPTED 2020-05-15 15:26:57

solution3 0 2020-05-15 15:48:03

solution1
2 2020-05-15 15:18:13

solution2
1 ACCPTED 2020-05-15 15:26:57

solution3
0 2020-05-15 15:48:03