简体   繁体   English

使用pyparsing查找关键字的前缀和后缀

[英]Find prefix and suffix of keyword using pyparsing

I'm trying to parse strings like this: aa bb first item ee ff 我正在尝试解析这样的字符串: aa bb first item ee ff

I need separate prefix ' aa bb ', keyword:' first item ' and suffix ' ee ff ' 我需要单独的前缀“ aa bb ”,关键字:“ 第一项 ”和后缀“ ee ff

Prefix and suffix can be several words or even doesn't exist. 前缀和后缀可以是几个单词,甚至不存在。 Keyword is list of predefined values. 关键字是预定义值的列表。

this is what I tried but it didn't work: 这是我尝试过的,但是没有用:

a = ZeroOrMore(Word(alphas)('prefix')) & oneOf(['first item', 'second item'])('word') & ZeroOrMore(Word(alphas)('suffix'))

First issue is your use of the '&' operator. 第一个问题是您对'&'运算符的使用。 In pyparsing, '&' produces Each expressions, which are like And s but accept the subexpressions in any order: 在pyparsing中,'&'生成Each表达式,它们与And相似And但是以任意顺序接受子表达式:

Word('a') & Word('b') & Word('c')

would match 'aaa bbb ccc', but also 'bbb aaa ccc', 'ccc bbb aaa', etc. 会匹配“ aaa bbb ccc”,但还会匹配“ bbb aaa ccc”,“ ccc bbb aaa”等。

In your parser, you'll want to use the '+' operator, which produces And expressions. 在您的解析器,你要使用“+”操作符,其产生And表达。 And s match several sub expressions, but only in the given order. And s匹配多个子表达式,但只能以给定的顺序进行。

Secondly, one of the reasons for using pyparsing is to accept varying whitespace. 其次,使用pyparsing的原因之一是接受变化的空格。 Whitespace is an issue for parsers, especially when using str.find or regexes - in regexes, this usually manifests as lots of \\s+ fragments throughout your match expressions. 空格是解析器的一个问题,尤其是在使用str.find或正则表达式时-在正则表达式中,这通常表现为整个匹配表达式中很多\\s+片段。 In your pyparsing parser, if the input string contains 'first item' (two spaces between 'first' and 'item'), trying to match a literal string 'first item' will fail. 在您的pyparsing解析器中,如果输入字符串包含'first item' (“ first”和“ item”之间的两个空格),则尝试匹配文字字符串“ first item”将失败。 Instead you should match the multiple words separately, probably using pyparsing's Keyword class, and let pyparsing skip over any whitespace between them. 相反,您可能应该使用pyparsing的Keyword类分别匹配多个单词,并让pyparsing跳过它们之间的任何空格。 To simplify this, I wrote a short method wordphrase : 为了简化此过程,我编写了一个简短的方法wordphrase

def wordphrase(s):
    return And(map(Keyword, s.split())).addParseAction(' '.join)
keywords = wordphrase('first item') | wordphrase('second item')
print(keywords)

prints: 打印:

{{"first" "item"} | {"second" "item"}}

indicating the each word will be parsed individually, with any number of spaces between the words. 表示每个单词将被单独解析,单词之间可以有任意数量的空格。

Lastly, you have to write pyparsing parsers knowing that pyparsing does not do any lookahead. 最后,您必须编写pyparsing解析器,知道pyparsing不会做任何前瞻。 In your parser, the prefix expression ZeroOrMore(Word(alphas)) will match all the words in "aa bb first item ee ff" - then there is nothing left to match the keywords expression, so the parser fails. 在您的解析器中,前缀表达式ZeroOrMore(Word(alphas))将匹配“ aa bb first item ee ff”中的所有单词-这样便没有匹配关键字表达式的内容,因此解析器将失败。 To code this in pyparsing, you have to write an expression in your ZeroOrMore for the prefix words that translates to "match every word of alphas, but first make sure we are not about to parse a keyword expression". 要在pyparsing代码这一点,你必须写在你的表达式ZeroOrMore为转化为前缀词“阿尔法的每一个字匹配,但首先要确保我们是不是要分析一个关键词表达”。 In pyparsing, this kind of negative lookahead is implemented using NotAny , which you can create using the unary ~ operator. 在pyparsing中,使用NotAny可以实现这种否定的超前NotAny ,您可以使用一元~运算符来创建。 For readabiity we'll use keywords expression from above: 为了便于阅读,我们将从上方使用keywords表达式:

non_keyword = ~keywords + Word(alphas)
a = ZeroOrMore(non_keyword)('prefix') + keywords('word') + ZeroOrMore(Word(alphas))('suffix')

Here is a complete parser, and results using runTests against different sample strings: 这是一个完整的解析器,并对不同的示例字符串使用runTests得出结果:

def wordphrase(s):
    return And(map(Keyword, s.split())).addParseAction(' '.join)
keywords = wordphrase('first item') | wordphrase('second item')

non_keyword = ~keywords + Word(alphas)
a = ZeroOrMore(non_keyword)('prefix') + keywords('word') + ZeroOrMore(Word(alphas))('suffix')

text = """
    # prefix and suffix
    aa bb first item ee ff

    # suffix only
    first item ee ff

    # prefix only
    aa bb first item

    # no prefix or suffix
    first item

    # multiple spaces in item, replaced with single spaces by parse action
    first   item
    """

a.runTests(text)

Gives: 得到:

# prefix and suffix
aa bb first item ee ff
['aa', 'bb', 'first item', 'ee', 'ff']
- prefix: ['aa', 'bb']
- suffix: ['ee', 'ff']
- word: 'first item'

# suffix only
first item ee ff
['first item', 'ee', 'ff']
- suffix: ['ee', 'ff']
- word: 'first item'

# prefix only
aa bb first item
['aa', 'bb', 'first item']
- prefix: ['aa', 'bb']
- word: 'first item'

# no prefix or suffix
first item
['first item']
- word: 'first item'

# multiple spaces in item, replaced with single spaces by parse action
first   item
['first item']
- word: 'first item'

If I understood your question correctly this should do the trick: 如果我正确理解了您的问题,这应该可以解决问题:

toParse='aa bb first item ee ff'
keywords=['test 1','first item','test two']
for x in keywords:
    res=toParse.find(x)
    if res>=0:
        print('prefix='+toParse[0:res])
        print('keyword='+x)
        print('suffix='+toParse[res+len(x)+1:])
        break

Gives this result: 给出以下结果:

prefix=aa bb 
keyword=first item
suffix=ee ff

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM