简体   繁体   中英

Python regular expression; brackets within brackets

I know that there are SO MANY python regular expression questions here, I just cannot figure out my specific question, even with examples.

I have tried using regex101 but it's just not clicking.

I have these sentences:

[Hi]-THISISALOADOFTEXT-[text]
I-X-(blah[THIS2CAN2Have-SymbolsAndNumbers0])-ABCD-{x}A-AB
A-[This can 4 have any X1 rubbish in it]-ABCDDS-OH
A-F{a}R-(textnumber1)-AB-[ThisIsText123]-P-{d}C-(ThisCanHaveNumbers1)-W-[ThisIsSymbolsText123]

I just want to pull out what is between the square brackets, EXCEPT when the square brackets are enclosed by parentheses (rounded brackets).

So in the above example, it would return:

[Hi], [text]
...nothing returned for line 2...
[This can 4 have any X1 rubbish in it]
[ThisIsText123], [ThisIsSymbolsText123]

It almost works with this code:

import re
pattern = re.compile(r'(\[.*?\])')
regex = re.findall(pattern,text)

I was trying to incorporate the 'not' like this: ?!A-Za-z0-9(\\[.*?\\]) that I got from the python manual , but various attempts at this not working.

The only problem is that the above code also returns [THIS2CAN2Have-SymbolsAndNumbers0], I do not want this, as it is enclosed by parentheses.

Importantly, and where I am getting stuck, is that there can be text and numbers in between the square brackets and the rounded brackets, as in this example: (blah[THIS2CAN2Have-SymbolsAndNumbers0])

Can someone help?

As a side note, just FYI, the ultimate goal once I figure out the regex is to incorporate into a loop that says:

  1. For each sentence, find text in square brackets
  2. If square brackets not enclosed by parentheses (rounded brackets), do one routine.
  3. elif square brackets enclosed by parentheses, do a different routine.

Edit 1: How could I extend this, so that for the sequences that have square brackets in parentheses, the full phrase in the parenthesis are returned. So for example, the input sequences:

[Hi]-THISISALOADOFTEXT-[text]
I-X-(blah[THIS2CAN2Have-SymbolsAndNumbers0])-ABCD-{x}A-AB
A-[This can 4 have any X1 rubbish in it]-ABCDDS-OH
A-F{a}R-(textnumber1)-AB-[ThisIsText123]-P-{d}C-(ThisCanHaveNumbers1)-W-[ThisIsSymbolsText123]

Would produce the output:

[Hi], [text]
(blah[THIS2CAN2Have-SymbolsAndNumbers0])
[This can 4 have any X1 rubbish in it]
[ThisIsText123], [ThisIsSymbolsText123]

in a way that i could then do different subroutines on rounded-bracket output ' (blah[THIS2CAN2Have-SymbolsAndNumbers0])' from the other outputs, not in rounded brackets.

You may use the two following patterns:

  • Not enclosed in brackets. \\[[^]]+\\](?!\\))
  • Enclosed in brackets. \\[[^]]+\\](?=\\))

As per your new requirement, you may use:

  • Encloded in brackets and include in match \\([^[]+\\[[^]]+\\]\\)

My answer assumes the brackets are balanced and the closing ) follows ] .

In Python:

import re
mytext='''
[Hi]-THISISALOADOFTEXT-[text]
I-X-(blah[THIS2CAN2Have-SymbolsAndNumbers0])-ABCD-{x}A-AB
A-[This can 4 have any X1 rubbish in it]-ABCDDS-OH
A-F{a}R-(textnumber1)-AB-[ThisIsText123]-P-{d}C-(ThisCanHaveNumbers1)-W-[ThisIsSymbolsText123]
'''

print('no ():')
for i in re.findall(r'\[[^]]+\](?!\))',mytext):
    print(i)
    #do one routine

print('with ():')
for i in re.findall(r'\([^[]+\[[^]]+\]\)',mytext):
    print (i)
    #do second routine

Prints:

no ():
[Hi]
[text]
[This can 4 have any X1 rubbish in it]
[ThisIsText123]
[ThisIsSymbolsText123]
with ():
(blah[THIS2CAN2Have-SymbolsAndNumbers0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM