How to extract hidden tags created by javascript from source page by python

Question

I have THIST page that has some javascript in it. You can see them by clicking on show details .

So how can I extract these data from that url source?

Using re ? What I tried in re is:

import urllib
import re
gdoc = urllib.urlopen('ThatURL').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis

But no response...

Using selenium? In this is case how?

import lxml
out=lxml.html.tostring(lxml.html.parse('ThatURL'))
.
.
.
?

Answer 1

When pages use scripting to generate content, it becomes hard to scrape. Instead of plain html reading, you need a full virtual environment capable of executing the script on the document.

For python, there's ghost.py . It's pretty flexible, and will allow you to inspect the fully rendered website, as well as to execute your own javascript to interact with the page.

ghost.py is a python clone of phantom.js , a node library. This second tool is superior, in my opinion, but it's not written for python.

Answer 2

你可以试试这个

re.findall('<script.*>.*</script>',url_file)

How to extract hidden tags created by javascript from source page by python

Question

2 answers

solution1
2 2014-09-29 13:59:46

solution2
0 2014-09-29 16:03:47

How to extract hidden tags created by javascript from source page by python

Question

2 answers

solution1 2 2014-09-29 13:59:46

solution2 0 2014-09-29 16:03:47

solution1
2 2014-09-29 13:59:46

solution2
0 2014-09-29 16:03:47