简体   繁体   中英

Find all occurrences of regex pattern, but ignore occurrences that contain another pattern

I have a block of text that I'm trying to parse:

「<%sM_item2><%sM_plusnum2>の| <%sM_slot>の部分を| <%sM_change_color>に カラーリングするのですね?|<br>|「それでは <%sM_item>が 10本と| <%nM_gold>ゴールドが必要ですが よろしいですか?|<yesno><close>

In this block of text, I'm trying to regex split on all occurrences of <???> , EXCEPT for when it matches on <%???> .

I have it mostly working with this:

re.split(r'<((?!%).+?)>', source_text)

['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', 'br', '|「それでは\u3000<%sM_item>が\u300010
本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', 'yesno', '', 'close', '']

My problem is although it kept the <%???> tags in place, it somehow stripped the <> characters from the matches (notice 'yesno', 'close', and 'br' tags no longer have those characters).

Based on the documentation of re.split :

Split string by the occurrences of pattern. If capturing parentheses are used 
in pattern, then the text of all groups in the pattern are also returned as 
part of the resulting list.

In this case, my parentheses needs to be placed on the outside of the match to preserve the () .

re.split('(<(?!%).+?>)', source_text)
['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', '<br>', '|「それでは\u3000<%sM_item>が\u300010本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', '<yesno>', '', '<close>', '']
 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM