简体   繁体   中英

Using python regex to change variations of a string?

I have a list of strings that follow this general pattern:

X (a, b, c, d)

where:

X is some variation of the string item description

a, b, c, d is some variation of comma separated words, symbols, numbers.

I'm trying to remove the parenthesis and text outside the parenthesis so that it becomes this:

a, b, c, d

I have noticed some horrendous variations on the input:

# ideal input
items (lcd, cardboard, hats on rack, keyboard cat)

# Sometimes missing/extra space (both outside text and inside)
items( lcd , cardboard,hats on rack  , keyboard cat)

# Outside text may contain other symbols and words
items & descrips: (lcd, cardboard, hats on rack, keyboard cat)

# Inner text may contain parenthesis, brackets, other enclosures
descriptions & items: (lcd (for computer), cardboard  {brown & white colored}, hats on rack, keyboard cat[dept. 11])

# Parent parenthesis may not be closed
items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)

# Using semi-colons instead of commas
item (lcd; cardboard; hats on rack; keyboard cat)

# Some text have non-ascii characters
item (lcd\u2122, cardboard)

The ideal output would be

lcd, cardboard, hats on rack, keyboard cat

Some clarifications:

(1) Any inner enclosures (and its data) should be removed

ie:

descriptions & items: (lcd (for computer), cardboard  {brown & white colored}, hats on rack, keyboard cat[dept. 11])

should be:

lcd, cardbard, hats on rack, keyboard cat

What's an appropriate regex for this? The different variations make this very difficult with my limited regex skills.

Sample input array:

a = [
"items (lcd, cardboard, hats on rack, keyboard cat)",
"items( lcd , cardboard,hats on rack  , keyboard cat)",
"items & descrips: (lcd, cardboard, hats on rack, keyboard cat)",
"descriptions & items: (lcd (for computer), cardboard  {brown & white colored}, hats on rack, keyboard cat[dept. 11])",
"items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)",
"items: (lcd, cardboard, hats on rack, keyboard cat [dept. 11]",
"item (lcd; cardboard; hats on rack; keyboard cat)", 
u"item (lcd\u2122, cardboard)"
]

Hmm...I'm not sure if this is you want or not, however it works fine if a is a list like your example:

import re

a = [
"items (lcd, cardboard, hats on rack, keyboard cat)",
"items( lcd , cardboard,hats on rack  , keyboard cat)",
"items & descrips: (lcd, cardboard, hats on rack, keyboard cat)",
"descriptions & items: (lcd (for computer), cardboard  {brown & white colored}, hats on rack, keyboard cat[dept. 11])",
"items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)",
"items: (lcd, cardboard, hats on rack, keyboard cat [dept. 11]",
"item (lcd; cardboard; hats on rack; keyboard cat)", 
u"item (lcd\u2122, cardboard)"
]

for i in [re.sub(' *[,;] *', ', ',
          re.sub('\(.+?\)|\[.+?\]|{.+?}', '',
          re.search('\((.*)', i).group(1))).strip() 
          for i in a]:

    if i[-1] == ')':
        i = i[:-1]

    if not re.search('[\(\[{}\]\)]', i):    
        print(i)

Output:

lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd, cardboard, hats on rack, keyboard cat
lcd™, cardboard

So this will do:

  1. Match (<text> in string (as you said Parent parenthesis may not be closed ).

  2. Use re.sub() remove (<string>) , [<string>] and {<string>} in the <text> .

  3. Change the format to readable, I mean use *[,;] * to match all spaces and , or ; , then replace them by , .

  4. Remove the ) at the line end...If there's one.

  5. If there's still some quotes in the <text> like I asked in comments (did you remove that example in your new list? Okay I'd keep this), then ignore it.

  6. Print the <string> out (you also can put them in a list...If you'd like).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM