在python中提取字符串的一部分

Question

I have to parse an input string in python and extract certain parts from it. 我必须在python中解析输入字符串并从中提取某些部分。

the format of the string is 字符串的格式是

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc 我想要一个函数返回xx，yyyy和包含aa，bb等的列表

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string 我当然可以通过尝试分割括号和内容来做到这一点，但我想知道是否存在从字符串中提取此类信息的适当pythonic方法

I have this code which works, but is there a better way to do it (without regex) 我有这段代码可以正常工作，但是有更好的方法（没有正则表达式）

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')

Answer 1

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. 如果您的括号嵌套可以任意深，则regexen不会，您将需要状态机或解析器。 Pyparsing supports recursive grammars using forward-declaration class Forward: Pyparsing使用正向声明类Forward支持递归语法：

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints: 印刷品：

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

Answer 2

If you're allergic to REs, you could use pyparsing : 如果您对RE过敏，可以使用pyparsing ：

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (eg by grouping the last 3 items into their own sublist), if you want. pyparsing还可让您轻松控制结果的确切形式（例如，通过将最后3个项目分组到自己的子列表中）。 But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-). 但是我认为最好的方面是可以使“语法规范”读得很自然（取决于您要占用多少空间）：一个开放的括号，一个词，一个逗号，一个词，一个逗号，一个开放的括号，一个定界的单词列表，两个封闭的括号（如果您发现对s的赋值不太容易阅读，我猜是我不选择更长的标识符是我的错；-）。

Answer 3

Let's use regular expressions! 让我们使用正则表达式！

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list. 与此匹配，第一个捕获组包含xx，第二个包含yyy，在上拆分第三个,您便有了列表。

Answer 4

How about like this? 这样怎么样

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

Answer 5

I don't know that this is better, but it's a different way to do it. 我不知道这会更好，但这是另一种方式。 Using the regex previously suggested 使用以前建议的正则表达式

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex. 或者，您可以使用两个链接的替换函数来代替正则表达式。

Answer 6

Your solution is decent (simple, efficient). 您的解决方案是体面的（简单，高效）。 You could use regular expressions to restrict the syntax if you don't trust your data source. 如果您不信任数据源，则可以使用正则表达式来限制语法。

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints: 印刷品：

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

在python中提取字符串的一部分

问题描述

6 个解决方案

解决方案1
3 2010-07-01 03:34:41

解决方案2
3 已采纳 2010-07-01 04:17:35

解决方案3
2 2010-07-01 02:32:56

解决方案4
2 2010-07-01 02:47:50

解决方案5
1 2010-07-01 02:36:57

解决方案6
0 2010-07-01 05:12:16

在python中提取字符串的一部分

问题描述

6 个解决方案

解决方案1 3 2010-07-01 03:34:41

解决方案2 3 已采纳 2010-07-01 04:17:35

解决方案3 2 2010-07-01 02:32:56

解决方案4 2 2010-07-01 02:47:50

解决方案5 1 2010-07-01 02:36:57

解决方案6 0 2010-07-01 05:12:16

解决方案1
3 2010-07-01 03:34:41

解决方案2
3 已采纳 2010-07-01 04:17:35

解决方案3
2 2010-07-01 02:32:56

解决方案4
2 2010-07-01 02:47:50

解决方案5
1 2010-07-01 02:36:57

解决方案6
0 2010-07-01 05:12:16