简体   繁体   中英

Parse JSON from html tag using Python

I've used BeautifulSoup to get the below snippet from an HTML page. I'm having trouble stripping out just the JSON (after FB_DATA). I'm guessing I need to use re.search, but I'm having trouble with the REGEX.

The snippet is:

<script type="text/javascript">
    var FB_DATA = {
        "foo": bar,
        "two": {
          "foo": bar,
        }};
    var FB_PUSH = []; 
    var FB_PULL = []; 
</script>

I'm assuming your main issue is using a .*? when . matches anything but new lines. Using the s dot-matches-newline modifier, you can accomplish this very simply:

(?s)    (?# dot-match-all modifier)
var     (?# match var literally)
\s+     (?# match 1+ whitespace)
FB_DATA (?# match FB_DATA literally)
\s*     (?# match 0+ whitespace)
=       (?# match = literally)
\s*     (?# match 0+ whitespace)
(       (?# start capture group)
 \{     (?# match { literally)
 .*?    (?# lazily match 0+ characters)
 \}     (?# match } literally)
)       (?# end capture group)
;       (?# match ; literally)

Demo


Your JSON string will be in capture group #1.

m = re.search(r"(?s)var\s+FB_DATA\s*=\s*(\{.*?\});", html)
print m.group(1)

start with

FB_DATA = (\{[^;]*;)

and see in which cases it's not enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM