The html-page incudes the following script:
<script>
const url = 'REQUIRED LINK';
window.location.href = url + window.location.search;
</script>
This is the only place in page, where the link is. I don't know Java at all.
I tried extract this way:
page_2 = requests.get(link).content.decode('UTF-8')
html_tree = html.fromstring(page_2)
inside_scripts = html_tree.xpath("//script[contains(@text, 'url')]")
But it returns empty list.
Let's suppose const url = 'REQUIRED LINK';
always uses the same formatting, including spaces.
You could run the following code - using regex - to extract 'REQUIRED LINK'
Javascript:
const regex = /(?<=const url = ').+(?=';)/gm;
var required_link = YOUR_HTML_STRING.match(regex);
Python:
import re
regex = r"(?<=const url = ').+(?=';)"
require_link = re.findall(regex, HTML_STRING)[0]
you should use:
inside_scripts = html_tree.xpath("//script[contains(., 'url')]")
One liner to extract it with XPath 1.0:
print(html_tree.xpath('substring-after(substring-before(//script[contains(.,"const url")],"';"),"= '")'))
Output: REQUIRED LINK
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.