简体   繁体   English

创建列表Lexer / Parser

[英]Creating a List Lexer/Parser

I need to create a lexer/parser which deals with input data of variable length and structure. 我需要创建一个lexer / parser来处理可变长度和结构的输入数据。

Say I have a list of reserved keywords: 假设我有一个保留关键字列表:

keyWordList = ['command1', 'command2', 'command3']

and a user input string: 和用户输入字符串:

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()

How would I go about writing this function: 我将如何编写此函数:

INPUT:

tokenize(userInputList, keyWordList)

OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper. 我编写了一个可以识别关键字的标记化程序,但是无法找到一种将非关键字组嵌入到更深层次的列表中的有效方法。

RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings. RE解决方案是受欢迎的,但我真的希望看到底层算法,因为我可能会将应用程序扩展到其他对象的列表而不仅仅是字符串。

Something like this: 像这样的东西:

def tokenize(lst, keywords):
    cur = []
    for x in lst:
        if x in keywords:
            yield cur
            yield x
            cur = []
        else:
            cur.append(x)

This returns a generator, so wrap your call in one to list . 这将返回一个生成器,因此将您的调用包装在一个list

That is easy to do with some regex: 一些正则表达式很容易做到这一点:

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

Now you just have to split the first element of each tuple. 现在你只需要拆分每个元组的第一个元素。

For more than one level of deepness, regex may not be a good answer. 对于一个以上的深度,正则表达式可能不是一个好的答案。

There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing 您可以在此页面上选择一些不错的解析器: http//wiki.python.org/moin/LanguageParsing

I think Lepl is a good one. 我认为Lepl很好。

Try this: 尝试这个:

keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()

def tokenize(userInputList, keyWordList):
    keywords = set(keyWordList)
    tokens, acc = [], []
    for e in userInputList:
        if e in keywords:
            tokens.append(acc)
            tokens.append(e)
            acc = []
        else:
            acc.append(e)
    if acc:
        tokens.append(acc)
    return tokens

tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']

Or have a look at PyParsing. 或者看看PyParsing。 Quite a nice little lex parser combination 相当不错的小lex解析器组合

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM