I have THIST page that has some javascript in it. You can see them by clicking on show details
.
So how can I extract these data from that url source?
Using re
? What I tried in re is:
import urllib
import re
gdoc = urllib.urlopen('ThatURL').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis
But no response...
Using selenium? In this is case how?
import lxml
out=lxml.html.tostring(lxml.html.parse('ThatURL'))
.
.
.
?
When pages use scripting to generate content, it becomes hard to scrape. Instead of plain html reading, you need a full virtual environment capable of executing the script on the document.
For python, there's ghost.py
. It's pretty flexible, and will allow you to inspect the fully rendered website, as well as to execute your own javascript to interact with the page.
ghost.py
is a python clone of phantom.js
, a node
library. This second tool is superior, in my opinion, but it's not written for python.
你可以试试这个
re.findall('<script.*>.*</script>',url_file)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.