简体   繁体   English

正则表达式匹配嵌套的 json 对象

[英]Regex to match nested json objects

I'm implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data .我正在实现某种解析器,我需要定位和反序列化嵌入到其他半结构化数据中的json 对象。 I used regexp:我使用了正则表达式:

\\{\\s*title.*?\\}

to locate object定位对象

{title:'Title'}

but it doesn't work with nested objects because expression matches only first found closing curly bracket.但它不适用于嵌套对象,因为表达式只匹配第一个找到的右大括号。 For为了

{title:'Title',{data:'Data'}}

it matches它匹配

{title:'Title',{data:'Data'}

so string becomes invalid for deserialization.所以字符串对于反序列化无效。 I understand that there's a greedy business coming into account but I'm not familiar with regexps.我知道考虑到贪婪的业务,但我不熟悉正则表达式。 Could you please help me to extend expression to consume all available closing curly brackets.你能帮我扩展表达式以使用所有可用的右大括号吗?

Update:更新:

To be clear, this is an attempt to extract JSON data from semi-structured data like HTML+JS with embedded JSON.需要明确的是,这是一种尝试从 HTML+JS 等带有嵌入 JSON 的半结构化数据中提取 JSON 数据的尝试。 I'm using GSon JAVA lib to actually parse extracted JSON.我正在使用 GSon JAVA lib 来实际解析提取的 JSON。

As others have suggested, a full-blown JSON parser is probably the way to go.正如其他人所建议的那样,一个成熟的 JSON 解析器可能是要走的路。 If you want to match the key-value pairs in the simple examples that you have above, you could use:如果要匹配上面简单示例中的键值对,可以使用:

(?<=\{)\s*[^{]*?(?=[\},])

For the input string对于输入字符串

{title:'Title',  {data:'Data', {foo: 'Bar'}}}

This matches:这匹配:

 1. title:'Title'
 2. data:'Data'
 3. foo: 'Bar'

This recursive Perl/PCRE regular expression should be able to match any valid JSON or JSON5 object, including nested objects and edge cases such as braces inside JSON strings or JSON5 comments:这个递归 Perl/PCRE 正则表达式应该能够匹配任何有效的 JSON 或 JSON5 对象,包括嵌套对象和边缘情况,例如 JSON 字符串或 JSON5 注释中的大括号:

/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/

Of course, that's a bit hard to read, so you might prefer the commented version:当然,这有点难读,所以你可能更喜欢评论版本:

m{
  (                               # Begin capture group (matching a JSON object).
    \{                              # Match opening brace for JSON object.
    (?:                             # Begin non-capturing group to contain alternations.
      (?>[^{}"'\/]+)                  # Match a non-empty string which contains no braces, quotes or slashes, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>"(?:(?>[^\\"]+)|\\.)*")      # Match a double-quoted JSON string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>'(?:(?>[^\\']+)|\\.)*')      # Match a single-quoted JSON5 string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\/.*\n)                    # Match a single-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\*.*?\*\/)                 # Match a multi-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?-1)                           # Recurse to most recent capture group, to match a nested JSON object.
    )*                              # End of non-capturing group; match zero or more repetitions of this group.
    \}                              # Match closing brace for JSON object.
  )                               # End of capture group (matching a JSON object).
}x

Thanks to @Sanjay T. Sharma that pointed me to "brace matching" because I eventually got some understanding of greedy expressions and also thanks to others for saying initially what I shouldn't do.感谢@Sanjay T. Sharma 指出我“大括号匹配”,因为我最终对贪婪的表达有了一些了解,也感谢其他人最初说我不应该做什么。 Fortunately it turned out it's OK to use greedy variant of expression幸运的是,结果证明可以使用表达式的贪婪变体

\\{\s*title.*\\}

because there is no non-JSON data between closing brackets.因为右括号之间没有非 JSON 数据。

This is absolutely horrible and I can't believe I'm actually putting my name to this solution, but could you not locate the first { character that is in a Javascript block and attempt to parse the remaining characters through a proper JSON parsing library?这绝对太可怕了,我不敢相信我真的把我的名字放在了这个解决方案中,但是你能不能找到 Javascript 块中的第一个{字符并尝试通过适当的 JSON 解析库解析剩余的字符? If it works, you've got a match.如果它有效,你就有一个匹配。 If it doesn't, keep reading until the next { character and start over.如果没有,请继续阅读直到下一个{字符并重新开始。

There are a few issues there, but they can probably be worked around:那里有一些问题,但它们可能可以解决:

  • you need to be able to identify Javascript blocks.您需要能够识别 Javascript 块。 Most languages have HTML to DOM libraries (I'm a big fan of Cyberneko for Java) that makes it easy to focus on the <script>...</script> blocks.大多数语言都有 HTML 到 DOM 库(我是Cyber​​neko for Java 的忠实粉丝),这使得专注于<script>...</script>块变得容易。
  • your JSON parsing library needs to stop consuming characters from the stream as soon as it spots an error, and it needs to not close the stream when it does.您的 JSON 解析库需要在发现错误后立即停止使用流中的字符,并且在发生错误时不需要关闭流。

An improvement would be, once you've found the first { , to look for the matching } one (a simple counter that is incremented whenever you find a { and decremented when you find a } should do the trick).一个改进是,一旦你找到第一个{ ,寻找匹配的}一个(一个简单的计数器,当你找到一个{时递增,当你找到一个}时递减}应该可以解决问题。 Attempt to parse the resulting string as JSON.尝试将结果字符串解析为 JSON。 Iterate until it works or you've ran out of likely blocks.迭代直到它工作或者你用完了可能的块。

This is ugly, hackish and should never make it to production code.这是丑陋的,骇人听闻的,永远不应将其用于生产代码。 I get the impression that you only need it for a batch-job, though, which is why I'm even suggesting it.不过,我的印象是您只需要在批处理作业中使用它,这就是我什至建议它的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM