简体   繁体   中英

How to extract specific script element from HTML using Beautiful Soup

I am using BS4 to extract info from a football stats page. Here is how I have started:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get(url)
soup = bs(res.content, 'lxml')
scripts = soup.find_all('script')
scripts = [script for script in scripts]

This successfully returns all script elements as a list.

I need to extract a specific script element

Specifically, one which begins as follows:

 <script>
    var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22...
</script>

I have tried various iterations of the following code, but the output always prints as blank:

for script in scripts: 
    if 'teamsData' in script.text: 
        print(script)

I could always resort to simply using 'print(scripts[2])', but I wanted to know why my initial efforts failed.

Thanks!

Apparently, .text is always an empty string for script tags. You can, however, get the contents of the tag from .children

from bs4 import BeautifulSoup
from io import StringIO

html = """
<script>
let a = "Hello";
</script>
"""
b = StringIO(html)
soup = BeautifulSoup(b, 'lxml')

for e in soup.find_all('script'):
    print(repr(e.text))
    print(repr(''.join(e.children)))

You can use .string to access the <script> string:

import re
import json
from bs4 import BeautifulSoup


html_doc = '''<script>
    var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22\x7D\x7D');
</script>'''

soup = BeautifulSoup(html_doc, 'html.parser')

script_string = soup.find('script').string
print(script_string)

Prints:

var teamsData = JSON.parse('{"71":{"id":"71","title":"Aston Villa","history":[]},"72":{"id":"72"}}');

To parse the JSON data, you can use re / json modules. For example:

data = re.search(r"JSON\.parse\('(.*?)'\);", script_string).group(1)
data = json.loads(data)

for k, v in data.items():
    print(k, v)

Prints:

71 {'id': '71', 'title': 'Aston Villa', 'history': []}
72 {'id': '72'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM