I've been looking at examples of how to do this but can't quite figure it out. I'm using beautifulsoup to scrape some data - I am able to use it to find the data I want, but it is contained in the following block of code. I'm trying to extract the timestamp information from it. I have a feeling regular expressions work here but I can't seem to figure it out - any suggestions??
<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {
etc other text
}
});
</script>
You can't use BS to get this data - BS works only with HTML/XML, not JavaScript.
You have to use regular expressions
or standart string functions.
EDIT:
text = '''<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {
etc other text
}
});
</script>'''
import re
re.findall("'([^']*)'", text)
result:
['2009-02-23 10 AM',
'2009-02-08 10 AM',
'2009-02-09 10 AM',
'2009-02-22 10 AM',
'2009-02-21 10 AM',
'2009-02-20 10 AM']
One another alternative to using regular expressions to parse javascript code would be to use a JavaScript parser like slimit
. Working code:
import json
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {};
});
</script>"""
soup = BeautifulSoup(data, "html.parser")
parser = Parser()
tree = parser.parse(soup.script.get_text())
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == 'line1':
values = json.loads(node.right.to_ecma().replace("'", '"').strip())
print(values)
break
Prints a Python list:
[[u'2009-02-23 10 AM', 5203], [u'2009-02-08 10 AM', 3898], [u'2009-02-09 10 AM', 4923], [u'2009-02-22 10 AM', 3682], [u'2009-02-21 10 AM', 3238], [u'2009-02-20 10 AM', 4648]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.