简体   繁体   中英

extract parts of the string in python

I have to parse an input string in python and extract certain parts from it.

the format of the string is

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string

I have this code which works, but is there a better way to do it (without regex)

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. Pyparsing supports recursive grammars using forward-declaration class Forward:

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints:

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

If you're allergic to REs, you could use pyparsing :

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (eg by grouping the last 3 items into their own sublist), if you want. But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-).

Let's use regular expressions!

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list.

How about like this?

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

I don't know that this is better, but it's a different way to do it. Using the regex previously suggested

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex.

Your solution is decent (simple, efficient). You could use regular expressions to restrict the syntax if you don't trust your data source.

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints:

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM