简体   繁体   中英

Regular expression to find all text inside of { } including spaces and newlines

I spent many hours and exhausted for this. I know regex is a very strong tool but it is too difficult to me. Please help me. I want to extract a json string from html pages. This is a example of the nested json.

<script>

            window.__INITIAL_STATE__ = {
       "properties":"ASSET_HOST", "https"
:"//asom","recaptcha":"ABCD", "aaa": {"b":"C", "D":"E"}
            };

        </script >

And I wrote a regex expression like this to extract all text rounded by curly braces {}.

parttern = '(\{.*\s*\});\s*<'

But it returns only parts of string.

{"b":"C", "D":"E"}
            }

Could you advice me how I should write a regex expression to extract all string rounded by {} please?

Not sure if this is what you want but in order to have the outer curly braces as well, you'll need a recursive approach which only works with the newer regex module. Consider

import regex as re

rx = re.compile(r'\{(?:[^{}]*|(?R))*\}')


junk = """
<script>

            window.__INITIAL_STATE__ = {
       "properties":"ASSET_HOST", "https"
:"//asom","recaptcha":"ABCD", "aaa": {"b":"C", "D":"E"}
            };

        </script >
"""

for match in rx.finditer(junk):
    print(match.group(0))

Which yields

{
       "properties":"ASSET_HOST", "https"
:"//asom","recaptcha":"ABCD", "aaa": {"b":"C", "D":"E"}
            }

See a demo for the expression on regex101.com .


Obligatory warning: "parsing" stuff like this with regular expressions is usually not the way to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM