extract parts of the string in python

Question

I have to parse an input string in python and extract certain parts from it.

the format of the string is

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string

I have this code which works, but is there a better way to do it (without regex)

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')

Answer 1

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. Pyparsing supports recursive grammars using forward-declaration class Forward:

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints:

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

Answer 2

If you're allergic to REs, you could use pyparsing :

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (eg by grouping the last 3 items into their own sublist), if you want. But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-).

Answer 3

Let's use regular expressions!

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list.

Answer 4

How about like this?

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

Answer 5

I don't know that this is better, but it's a different way to do it. Using the regex previously suggested

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex.

Answer 6

Your solution is decent (simple, efficient). You could use regular expressions to restrict the syntax if you don't trust your data source.

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints:

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

extract parts of the string in python

Question

6 answers

solution1
3 2010-07-01 03:34:41

solution2
3 ACCPTED 2010-07-01 04:17:35

solution3
2 2010-07-01 02:32:56

solution4
2 2010-07-01 02:47:50

solution5
1 2010-07-01 02:36:57

solution6
0 2010-07-01 05:12:16

extract parts of the string in python

Question

6 answers

solution1 3 2010-07-01 03:34:41

solution2 3 ACCPTED 2010-07-01 04:17:35

solution3 2 2010-07-01 02:32:56

solution4 2 2010-07-01 02:47:50

solution5 1 2010-07-01 02:36:57

solution6 0 2010-07-01 05:12:16

solution1
3 2010-07-01 03:34:41

solution2
3 ACCPTED 2010-07-01 04:17:35

solution3
2 2010-07-01 02:32:56

solution4
2 2010-07-01 02:47:50

solution5
1 2010-07-01 02:36:57

solution6
0 2010-07-01 05:12:16