简体   繁体   中英

How to extract hidden tags created by javascript from source page by python

I have THIST page that has some javascript in it. You can see them by clicking on show details .

So how can I extract these data from that url source?

Using re ? What I tried in re is:

import urllib
import re
gdoc = urllib.urlopen('ThatURL').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis

But no response...

Using selenium? In this is case how?

import lxml
out=lxml.html.tostring(lxml.html.parse('ThatURL'))
.
.
.
?

When pages use scripting to generate content, it becomes hard to scrape. Instead of plain html reading, you need a full virtual environment capable of executing the script on the document.

For python, there's ghost.py . It's pretty flexible, and will allow you to inspect the fully rendered website, as well as to execute your own javascript to interact with the page.

ghost.py is a python clone of phantom.js , a node library. This second tool is superior, in my opinion, but it's not written for python.

你可以试试这个

re.findall('<script.*>.*</script>',url_file)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM