简体   繁体   中英

Extract Text from Javascript using Python

I've been looking at examples of how to do this but can't quite figure it out. I'm using beautifulsoup to scrape some data - I am able to use it to find the data I want, but it is contained in the following block of code. I'm trying to extract the timestamp information from it. I have a feeling regular expressions work here but I can't seem to figure it out - any suggestions??

    <script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>

You can't use BS to get this data - BS works only with HTML/XML, not JavaScript.

You have to use regular expressions or standart string functions.


EDIT:

text = '''<script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>'''

import re

re.findall("'([^']*)'", text)

result:

['2009-02-23 10 AM',
 '2009-02-08 10 AM',
 '2009-02-09 10 AM',
 '2009-02-22 10 AM',
 '2009-02-21 10 AM',
 '2009-02-20 10 AM']

One another alternative to using regular expressions to parse javascript code would be to use a JavaScript parser like slimit . Working code:

import json

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {};
});
</script>"""

soup = BeautifulSoup(data, "html.parser")
parser = Parser()
tree = parser.parse(soup.script.get_text())

for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == 'line1':
        values = json.loads(node.right.to_ecma().replace("'", '"').strip())
        print(values)
        break

Prints a Python list:

[[u'2009-02-23 10 AM', 5203], [u'2009-02-08 10 AM', 3898], [u'2009-02-09 10 AM', 4923], [u'2009-02-22 10 AM', 3682], [u'2009-02-21 10 AM', 3238], [u'2009-02-20 10 AM', 4648]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM