简体   繁体   中英

Extract data from <script> with beautifulsoup

I'm trying to scrape some data with Python and Beautifulsoup. I know how to get the text from the script tag. The data between [ ] is valid json.

<script>
    dataLayer = 
[  
  {  
  "p":{  
         "t":"text1",
         "lng":"text2",
         "vurl":"text3"
       },
  "c":{  },
  "u":{  },
  "d":{  },
  "a":{  }
  }
]
</script>

I've read this response and it almost does what I want: Extract content of <Script with BeautifulSoup

Here is my code:

import urllib.request
from bs4 import BeautifulSoup
import json

url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
raw_data = soup.find("script")

I would then ideally do:

json_dict = json.loads(raw_data)

And access the data through the dictionary. But this is not working because of

"<script> dataLayer =" 

preceding the valid json and the script tag at the end. I've tried trimming the raw_data as a string, like this:

raw_data[20:]

But this didn't work because the soup object is not a string.

How can I get the raw_data variable to contain ONLY the text between the block quotes [ ]?

EDIT: this seems to work. It avoids regex and solves the problem of the trailing chars as well. Thanks for your suggestions.

url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")

# get the script tag data and convert soup into a string
data = str(soup.find("script"))

# cut the <script> tag and some other things from the beginning and end to get valid JSON
cut = data[27:-13]

# load the data as a json dictionary
jsoned = json.loads(cut)
>>> import re
>>> soup.find_all(re.compile("\[(.*?)\]"))

you would do that with regex

You will have to create a regex norm that only takes text between []

here a link of common regex usage within beautifulsoup

here the regex to extract from between square brackets

use .text to get content inside <script> tag then replace dataLayer =

raw_data = soup.find("script")
raw_data = raw_data.text.replace('dataLayer =', '')
json_dict = json.loads(raw_data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM