简体   繁体   中英

Regex to remove all except XML

I need help with a Regex for notepad++ to match all but XML

The regex I'm using: (!?\\<.*\\>) <-- I want the opposite of this (in first three lines)

The example code:

[20173003] This text is what I want to delete [<Person><Name>Foo</Name><Surname>Bar</Surname></Person>], and this text too.
[20173003] This is another text to delete [<Person><Name>Bar</Name><Surname>Foo</Surname></Person>]
[20173003] This text too... [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], delete me!
[20173003] But things like this make the regex to fail < [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], or this>

Expected result:

<Person><Name>Foo</Name><Surname>Bar</Surname></Person>
<Person><Name>Bar</Name><Surname>Foo</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>

Thanks in advance!

This is not perfect, but should work with your input that looks quite simple and well-structured.

If you need to handle just a single unnested <Person> tag , you may use simple (<Person>.*?</Person>)|. regex (that will match and capture into Group 1 any <Person> tag and will match any other char) and replace with a conditional replacement pattern (?{1}$1\\n:) (that will reinsert Person tag with a newline after it or will replace the match with an empty string):

在此处输入图片说明

To make it a bit more generic , you may capture the opening and corresponding closing XML tags with a recursion-based Boost regex, and the appropriate conditional replacement pattern:

Find What : (<(\\w+)[^>]*>(?:(?!</?\\2\\b).|(?1))*</\\2>)|.
Replace With : (?{1}$1\\n:)
. matches newline : ON

在此处输入图片说明

Regex Details :

  • (<(\\w+)[^>]*>(?:(?!</?\\2\\b).|(?1))*</\\2>) - Capturing group 1 (that will be later recursed with the (?1) subrouting call) matching
    • <(\\w+)[^>]*> - any opening tag with its name captured into Group 2
    • (?:(?!</?\\2\\b).|(?1))* - zero or more occurrences of:
      • (?!</?\\2\\b). - any char ( . ) not starting a sequence of </ + tag name as a whole word with an optional / in front
      • | - or
      • (?1) - the whole Group 1 subpattern is recursed (repeated)
    • </\\2> - the corresponding closing tag
  • | - or
  • . - any single char.

Replacement pattern :

  • (?{1} - if Group 1 matches:
    • $1\\n - replace with its contents + a newline
    • : - else replace with an empty string
  • ) - end of the replacement pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM