简体   繁体   English

在python中提取字符串的一部分

[英]extract parts of the string in python

I have to parse an input string in python and extract certain parts from it. 我必须在python中解析输入字符串并从中提取某些部分。

the format of the string is 字符串的格式是

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc 我想要一个函数返回xx,yyyy和包含aa,bb等的列表

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string 我当然可以通过尝试分割括号和内容来做到这一点,但我想知道是否存在从字符串中提取此类信息的适当pythonic方法

I have this code which works, but is there a better way to do it (without regex) 我有这段代码可以正常工作,但是有更好的方法(没有正则表达式)

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. 如果您的括号嵌套可以任意深,则regexen不会,您将需要状态机或解析器。 Pyparsing supports recursive grammars using forward-declaration class Forward: Pyparsing使用正向声明类Forward支持递归语法:

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints: 印刷品:

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

If you're allergic to REs, you could use pyparsing : 如果您对RE过敏,可以使用pyparsing

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (eg by grouping the last 3 items into their own sublist), if you want. pyparsing还可让您轻松控制结果的确切形式(例如,通过将最后3个项目分组到自己的子列表中)。 But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-). 但是我认为最好的方面是可以使“语法规范”读得很自然(取决于您要占用多少空间):一个开放的括号,一个词,一个逗号,一个词,一个逗号,一个开放的括号,一个定界的单词列表,两个封闭的括号(如果您发现对s的赋值不太容易阅读,我猜是我不选择更长的标识符是我的错;-)。

Let's use regular expressions! 让我们使用正则表达式!

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list. 与此匹配,第一个捕获组包含xx,第二个包含yyy,在上拆分第三个,您便有了列表。

How about like this? 这样怎么样

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

I don't know that this is better, but it's a different way to do it. 我不知道这会更好,但这是另一种方式。 Using the regex previously suggested 使用以前建议的正则表达式

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex. 或者,您可以使用两个链接的替换函数来代替正则表达式。

Your solution is decent (simple, efficient). 您的解决方案是体面的(简单,高效)。 You could use regular expressions to restrict the syntax if you don't trust your data source. 如果您不信任数据源,则可以使用正则表达式来限制语法。

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints: 印刷品:

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM