简体   繁体   中英

Can I use BeautifulSoup to dig into inline JavaScript?

I want to scrape a block of data from a series of pages that have the data tucked away in a JSON object inside of a script tag. I'm fairly comfortable with BeautifulSoup, but I think I might be barking up the wrong tree trying to use it to get data from JavaScript.

The structure of the pages is, roughly, this:

...
<script>
  $(document).ready(function(){
    var data = $.data(graph_selector, [
         { data: charts.createData("Stuff I want")}
    ])};
</script>

The head and body have a zillion scripts each, but there's only one var data per page. I'm not sure how I'd identify this particular <script> for BeautifulSoup except by the presence of var data

Can I do this? Or do I need another tool?

BeautifulSoup is an HTML parser, it cannot parse javascript code.

Here are the options you have:

  1. use a javascript parser like slimit

     from bs4 import BeautifulSoup from slimit import ast from slimit.parser import Parser from slimit.visitors import nodevisitor data = """ <script> var data = $.data(graph_selector, [ { data: charts.createData("Stuff I want")} ]); </script> """ soup = BeautifulSoup(data) script = soup.find('script') parser = Parser() tree = parser.parse(script.text) print next(node.args[0].value for node in nodevisitor.visit(tree) if isinstance(node, ast.FunctionCall) and node.identifier.identifier.value == 'createData') # prints "Stuff I want" 

    Note that I had to cut down the script for the sake of a working example and due to parsing errors. Might not work for your real script contents, please check.

  2. use regular expressions (the easiest option yet unreliable so don't use it in production code unless you have control over the JS code too and can make the guarantees needed):

     import re from bs4 import BeautifulSoup data = """ <script> $(document).ready(function() { var data = $.data(graph_selector, [{data: charts.createData("Stuff I want")}])}; </script> """ soup = BeautifulSoup(data) script = soup.find('script') pattern = r'charts.createData\\("(.*?)"\\)' print re.search(pattern, script.text).group(1) # prints "Stuff I want" 
  3. let smth execute the javascript code: selenium (real browser), or V8 , or PyExecJS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM