How to extract specific script element from HTML using Beautiful Soup

Question

I am using BS4 to extract info from a football stats page. Here is how I have started:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get(url)
soup = bs(res.content, 'lxml')
scripts = soup.find_all('script')
scripts = [script for script in scripts]

This successfully returns all script elements as a list.

I need to extract a specific script element

Specifically, one which begins as follows:

 <script>
    var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22...
</script>

I have tried various iterations of the following code, but the output always prints as blank:

for script in scripts: 
    if 'teamsData' in script.text: 
        print(script)

I could always resort to simply using 'print(scripts[2])', but I wanted to know why my initial efforts failed.

Thanks!

Answer 1

Apparently, .text is always an empty string for script tags. You can, however, get the contents of the tag from .children

from bs4 import BeautifulSoup
from io import StringIO

html = """
<script>
let a = "Hello";
</script>
"""
b = StringIO(html)
soup = BeautifulSoup(b, 'lxml')

for e in soup.find_all('script'):
    print(repr(e.text))
    print(repr(''.join(e.children)))

Answer 2

You can use .string to access the <script> string:

import re
import json
from bs4 import BeautifulSoup


html_doc = '''<script>
    var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22\x7D\x7D');
</script>'''

soup = BeautifulSoup(html_doc, 'html.parser')

script_string = soup.find('script').string
print(script_string)

Prints:

var teamsData = JSON.parse('{"71":{"id":"71","title":"Aston Villa","history":[]},"72":{"id":"72"}}');

To parse the JSON data, you can use re / json modules. For example:

data = re.search(r"JSON\.parse\('(.*?)'\);", script_string).group(1)
data = json.loads(data)

for k, v in data.items():
    print(k, v)

Prints:

71 {'id': '71', 'title': 'Aston Villa', 'history': []}
72 {'id': '72'}

How to extract specific script element from HTML using Beautiful Soup

Question

2 answers

solution1
0 ACCPTED 2020-09-15 17:24:50

solution2
0 2020-09-15 17:33:31

How to extract specific script element from HTML using Beautiful Soup

Question

2 answers

solution1 0 ACCPTED 2020-09-15 17:24:50

solution2 0 2020-09-15 17:33:31

solution1
0 ACCPTED 2020-09-15 17:24:50

solution2
0 2020-09-15 17:33:31