简体   繁体   English

Python 和 RegEx:重复模式不起作用

[英]Python and RegEx: Repeating Pattern Not Working

I have a text file with a proprietary programming language and I want to extract the relevant information about various function calls.我有一个带有专有编程语言的文本文件,我想提取有关各种函数调用的相关信息。

The structure of the function is:函数的结构是:

function name(input1, input2) returns (output1, output2) function body函数名(input1, input2) 返回(output1, output2) 函数体

I'm using Python and RegEx to capture this information, but I've hit a snag.我正在使用 Python 和 RegEx 来捕获这些信息,但我遇到了障碍。 I can capture the name, the inputs and the outputs, but I am unable to grab all of the function body.我可以捕获名称、输入和输出,但我无法获取所有函数体。

I use the following line to capture this info:我使用以下行来捕获此信息:

re.findall("(function)(.*?)\((.*?)\) returns \((.*?)\)(.*)", file_contents)

However, after the first instance of the word, 'function', this fails.然而,在单词“function”的第一个实例之后,这失败了。 Due to nested statements in the function body, I am unable to use a particular keyword (I've tried different approaches, and I cannot fully grab the entire body) to grab the last group (this would be the function body).由于函数体中的嵌套语句,我无法使用特定关键字(我尝试了不同的方法,但无法完全抓取整个函数体)来抓取最后一组(这将是函数体)。

How can I group everything after a particular point and then repeat the pattern?如何在特定点之后对所有内容进行分组,然后重复该模式?

What I want: 'function', 'name', 'input1, input2', 'output1, output2', 'function body' to repeat indefinitely.我想要的是: 'function', 'name', 'input1, input2', 'output1, output2', 'function body' 无限重复。 I want the last group to grab everything after the outputs and then the pattern to restart when it gets to the next occurrence of the word 'function'.我希望最后一组在输出之后抓取所有内容,然后在下一次出现“函数”一词时重新启动模式。 I've tried different variations of the (. ?) and (. ) quantifiers, but I can't seem to get it.我尝试了 (. ?) 和 (. ) 量词的不同变体,但我似乎无法理解。

I am not a programmer by trade, so I am not that adept with RegEx or Python.我不是专业的程序员,所以我不太擅长使用 RegEx 或 Python。 I know just enough to do the very basics.我知道的足以做最基本的事情。

This will grab the function up until the next function.这将抓取函数直到下一个函数。
There are 5 capture groups.有5个捕获组。

If using findall, post-process into a group of 5's to get results.如果使用 findall,则后处理为一组 5 以获得结果。

(?s)(\\bfunction\\b)(.*?)\\((.*?)\\)\\s+returns\\s+\\((.*?)\\)((?:(?!\\bfunction\\b).)*)

https://regex101.com/r/PkfofA/1 https://regex101.com/r/PkfofA/1

Expanded展开

 (?s)
 ( \b function \b )            # (1)
 ( .*? )                       # (2)
 \( 
 ( .*? )                       # (3)
 \) \s+ returns \s+ \( 
 ( .*? )                       # (4)
 \) 
 (                             # (5 start)
      (?:
           (?! \b function \b )
           . 
      )*
 )                             # (5 end)

I guess finditer() is a way to get better control of each set of 5 groups :我想 finditer() 是一种更好地控制每组 5 个组的方法:

iter = re.finditer(r"(?s)(\bfunction\b)(.*?)\((.*?)\)\s+returns\s+\((.*?)\)((?:(?!\bfunction\b).)*)", target)
for result in iter:
    g1 = result.group(1)
    g2 = result.group(2)
    g3 = result.group(3)
    g4 = result.group(4)
    g5 = result.group(5)

Based on further information from the comments, I tested the following regex code using the re.findall function in Python3.6, which works with the example:根据评论中的更多信息,我使用re.findall中的re.findall函数测试了以下正则表达式代码,该函数适用于示例:

import re

file_contents = "function func1(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end \n random code \nfunction func2(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;"

pattern = r"(function) (.*?)\((.*?)\) returns \((.*?)\) (.*)"
regex_results = re.findall( pattern, file_contents )

print( regex_results )

Output:输出:

[('function', 'func1', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end '), ('function', 'func2', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;')] [('function', 'func1', 'in1 : bool; in2 : bool; in3 : bool', 'out : bool', 'var L1: bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end '), ('function', 'func2', 'in1 : bool; in2 : bool; in3 : bool', 'out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3和L4; L2 = L1或L5; out = L2; L5 = in3; L4 = in2; L3 = in1;')]

I figured out a different way to accomplish what I was trying to do.我想出了一种不同的方法来完成我想做的事情。

I used the following line:我使用了以下行:

re.split('(function )(.*?)\\((.*?)\\) returns \\((.*?)\\)', contents)

This will split up what I wanted into a list.这将把我想要的东西分成一个列表。 I then chunk the list and assign it to the variables I have.然后我将列表分块并将其分配给我拥有的变量。

Thanks for everyone who took the time to answer.感谢所有花时间回答的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM