I want to scrape a block of data from a series of pages that have the data tucked away in a JSON object inside of a script tag. I'm fairly comfortable with BeautifulSoup, but I think I might be barking up the wrong tree trying to use it to get data from JavaScript.
The structure of the pages is, roughly, this:
...
<script>
$(document).ready(function(){
var data = $.data(graph_selector, [
{ data: charts.createData("Stuff I want")}
])};
</script>
The head and body have a zillion scripts each, but there's only one var data
per page. I'm not sure how I'd identify this particular <script>
for BeautifulSoup except by the presence of var data
Can I do this? Or do I need another tool?
BeautifulSoup
is an HTML parser, it cannot parse javascript code.
Here are the options you have:
use a javascript parser like slimit
from bs4 import BeautifulSoup from slimit import ast from slimit.parser import Parser from slimit.visitors import nodevisitor data = """ <script> var data = $.data(graph_selector, [ { data: charts.createData("Stuff I want")} ]); </script> """ soup = BeautifulSoup(data) script = soup.find('script') parser = Parser() tree = parser.parse(script.text) print next(node.args[0].value for node in nodevisitor.visit(tree) if isinstance(node, ast.FunctionCall) and node.identifier.identifier.value == 'createData') # prints "Stuff I want"
Note that I had to cut down the script for the sake of a working example and due to parsing errors. Might not work for your real script contents, please check.
use regular expressions (the easiest option yet unreliable so don't use it in production code unless you have control over the JS code too and can make the guarantees needed):
import re from bs4 import BeautifulSoup data = """ <script> $(document).ready(function() { var data = $.data(graph_selector, [{data: charts.createData("Stuff I want")}])}; </script> """ soup = BeautifulSoup(data) script = soup.find('script') pattern = r'charts.createData\\("(.*?)"\\)' print re.search(pattern, script.text).group(1) # prints "Stuff I want"
let smth execute the javascript code: selenium
(real browser), or V8
, or PyExecJS
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.