简体   繁体   中英

How do I split a string based on multiple delimiters including text within parentheses in Python3?

I want to split a string based on multiple delimiters:

  • ,
  • .
  • /
  • \
  • |
  • +
  • &
  • ;
  • AND (case insensitive)

However, I also want to extract text within brackets of different types, () , {} , []

This is an example string that I want to convert:

"Hello (Bob), Tree+Leaf. {text} AND Bye"

And I would want it to be split into an array like such:
["Hello", "Bob", "Tree", "Leaf", "text", "Bye"]

I understand how I can split the substrings based on commas, spaces, by using re.split(',|.|/|\\|\||\+|\&|;|AND', input_string) , but I am not sure how you can also extract the text out of the parantheses in the same iteration as doing the other delimiter splits.

Also I would like it so that all the substrings are trimmed, for example if I were to split on this string "Hello, World" I would want the output to be ["Hello", "World"] and not ["Hello", " World"] .

Use

[t for t in re.split(r'\s*(?:\bAND\b|[,./\\|+&;]|\(([^()]*)\)|\[([^][]*)]|{([^{}]*)})\s*', input_string) if t]

EXPLANATION

--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    AND                      'AND'
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [,./\\|+&;]              any character of: ',', '.', '/', '\\',
                             '|', '+', '&', ';'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \(                       '('
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^()]*                   any character except: '(', ')' (0 or
                               more times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \)                       ')'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \[                       '['
--------------------------------------------------------------------------------
    (                        group and capture to \2:
--------------------------------------------------------------------------------
      [^][]*                   any character except: ']', '[' (0 or
                               more times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \2
--------------------------------------------------------------------------------
    ]                        ']'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    {                        '{'
--------------------------------------------------------------------------------
    (                        group and capture to \3:
--------------------------------------------------------------------------------
      [^{}]*                   any character except: '{', '}' (0 or
                               more times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \3
--------------------------------------------------------------------------------
    }                        '}'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

See Python proof :

import re
input_string = "Hello (Bob), Tree+Leaf. {text} AND Bye"
print( [t for t in re.split(r'\s*(?:\bAND\b|[,./\\|+&;]|\(([^()]*)\)|\[([^][]*)]|{([^{}]*)})\s*', input_string) if t] )

Results : ['Hello', 'Bob', 'Tree', 'Leaf', 'text', 'Bye']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM