简体   繁体   English

不按顺序将元素列表与字符串匹配

[英]Matching a list of elements to a string in no order

I am trying to match multiple elements to a single string with little to no luck.我试图将多个元素匹配到一个字符串,但运气不佳。

The regex should return all the elements that are in the token array, as many times as they occur in the string in the same order they occur, this would be a basic lexing algorithm for a very basic C compiler.正则表达式应该返回标记数组中的所有元素,它们在字符串中的出现次数与它们出现的顺序相同,这将是非常基本的 C 编译器的基本词法分析算法。

Is there a way i could transform my array to a working pattern where the elements are essentially unordered?有没有一种方法可以将我的数组转换为元素基本上无序的工作模式? I have not found any other patterns that could work in my case as the elements of my array could appear anywhere in the string.我还没有找到任何其他适用于我的情况的模式,因为我的数组元素可能出现在字符串中的任何位置。

file = """
int main() {
    return 2;
}"""

tokens = ['{', '}', '\(', '\)', ';', "int", "return", '[a-zA-Z]\w*', '[0-9]+']

def lex(file):
    results = []
    for i in tokens:
        r = re.match(r".?"+i+".",file)
        if r != None:
            results.append(r.group())
    return r

the output should be something like this: output 应该是这样的:

r = ["int", "main", "(", ")", "{", "return", "2", ";", "}"]

Based on the solution from What is the Python way of doing a \G anchored parsing loop you can use基于What is the Python way of doing a \G anchored parsing loop的解决方案,您可以使用

import re
file = """
int main() {
    return 2;
}"""
 
tokens = ['{','}',r'\(',r'\)',';',"int","return",r'[a-zA-Z]\w*','[0-9]+']
p = re.compile(fr"\s*({'|'.join(tokens)})")
 
def tokenize(w, pattern):
    index = 0
    m = pattern.match(w, index)
    o = []

    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = pattern.match(w, index)
    return o
 
print(tokenize(file, p))
# => ['int', 'main', '(', ')', '{', 'return', '2', ';', '}']

See the Python demo .请参阅Python 演示 See the regex demo .请参阅正则表达式演示

Basically, this matches any of the patterns in the tokens list consecutively after zero or more whitespaces starting from the start of the string.基本上,这会在从字符串开头开始的零个或多个空格之后连续匹配tokens列表中的任何模式。

This also means you must have a complete set of token patterns that might appear in the input, else, this will stumble at non-matching text.这也意味着您必须有一套完整的可能出现在输入中的标记模式,否则,这将遇到不匹配的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM