使用Python从html标记解析JSON

Question

I've used BeautifulSoup to get the below snippet from an HTML page. 我使用BeautifulSoup从HTML页面获取以下代码段。 I'm having trouble stripping out just the JSON (after FB_DATA). 我在剥离JSON（在FB_DATA之后）时遇到麻烦。 I'm guessing I need to use re.search, but I'm having trouble with the REGEX. 我猜我需要使用re.search，但是REGEX遇到了麻烦。

The snippet is: 片段是：

<script type="text/javascript">
    var FB_DATA = {
        "foo": bar,
        "two": {
          "foo": bar,
        }};
    var FB_PUSH = []; 
    var FB_PULL = []; 
</script>

Answer 1

I'm assuming your main issue is using a .*? 我假设您的主要问题是使用.*? when . 当. matches anything but new lines. 匹配除换行以外的任何内容。 Using the s dot-matches-newline modifier, you can accomplish this very simply: 使用s dot-matches-newline修饰符，您可以非常简单地完成此操作：

(?s)    (?# dot-match-all modifier)
var     (?# match var literally)
\s+     (?# match 1+ whitespace)
FB_DATA (?# match FB_DATA literally)
\s*     (?# match 0+ whitespace)
=       (?# match = literally)
\s*     (?# match 0+ whitespace)
(       (?# start capture group)
 \{     (?# match { literally)
 .*?    (?# lazily match 0+ characters)
 \}     (?# match } literally)
)       (?# end capture group)
;       (?# match ; literally)

Demo 演示

Your JSON string will be in capture group #1. 您的JSON字符串将在捕获组＃1中。

m = re.search(r"(?s)var\s+FB_DATA\s*=\s*(\{.*?\});", html)
print m.group(1)

Answer 2

start with 从...开始

FB_DATA = (\{[^;]*;)

and see in which cases it's not enough. 看看在什么情况下还不够

使用Python从html标记解析JSON

问题描述

2 个解决方案

解决方案1
5 已采纳 2014-05-27 18:51:16

解决方案2
0 2014-05-27 18:51:23

使用Python从html标记解析JSON

问题描述

2 个解决方案

解决方案1 5 已采纳 2014-05-27 18:51:16

解决方案2 0 2014-05-27 18:51:23

解决方案1
5 已采纳 2014-05-27 18:51:16

解决方案2
0 2014-05-27 18:51:23