[英]Python Regex to find everything within parenthesis, with a prefix beforehand
This seems like a fairly simple issue, but I can't get it to work.这似乎是一个相当简单的问题,但我无法让它发挥作用。
I have a text file, which contains JSON like data, but there are a couple of additional lines, stopping it being a valid JSON and I need to remove these.我有一个文本文件,其中包含类似 JSON 的数据,但还有几行额外的行,阻止它成为有效的 JSON,我需要删除这些行。 This sounds very simple and even more so, as the valid JSON strings (which I can parse later) are always contained in the following container:
这听起来非常简单,甚至更简单,因为有效的 JSON 字符串(我可以稍后解析)始终包含在以下容器中:
xyz() xyz()
So for example, the dataset will be something like:例如,数据集将类似于:
abcdefg
xyz({"id_value": 123, "text_value": "efg"})
abcdefg
xyz({"id_value": 124, "text_value": "hij"})
Each separate JSON string is always prefixed by abcdefg and then xyz( and there is always a closing bracket after. So the format is consistent.每个单独的 JSON 字符串总是以 abcdefg 为前缀,然后是 xyz( 并且后面总是有一个右括号。所以格式是一致的。
I was trying the following:我正在尝试以下操作:
re.findall(r'xyz\(.*?\)', text_file)
However despite attempting variations of this (eg using re.search, trying \\w+ etc.) nothing seems to work (by which I mean it returns an empty list).然而,尽管尝试了这种变化(例如使用 re.search,尝试 \\w+ 等)似乎没有任何效果(我的意思是它返回一个空列表)。
If I just try to do the following:如果我只是尝试执行以下操作:
re.findall(r'xyz\(
Then it returns:然后它返回:
['xyz(', 'xyz(']
As expected.正如预期的那样。
So the issue appears to be with the string in the brackets, but I can not work out what the problem is, as other examples on here suggest my code is correct (which it can't be as it doesn't work)!所以问题似乎与括号中的字符串有关,但我无法弄清楚问题是什么,因为这里的其他示例表明我的代码是正确的(它不可能是因为它不起作用)!
I presume its something horrifically simple, but I'm a bit stuck!我认为它的东西非常简单,但我有点卡住了!
You can install PyPi regex
module by rinning pip install regex
(or pip3 install regex
) and then using this library to match strings between xyz(
and the next paired )
char using:您可以通过 rinning
pip install regex
(或pip3 install regex
)然后使用此库来匹配xyz(
和下一个配对)
字符之间的字符串,使用以下方法来安装 PyPi regex
模块:
import regex
#...
output = [x.group() for x in regex.finditer(r'xyz(\((?:[^()]++|(?1))*\))', text_file)
The list comprehension is used to avoid the issue with regex.findall
when only captured substrings are returned when a capturing group is defined in the regex (and here, the capturing group around parentheses is required since it is recursed inside the pattern with a (?1)
subroutine.当在正则表达式中定义捕获组时仅返回捕获的子字符串时,列表
regex.findall
用于避免regex.findall
的问题(在这里,括号周围的捕获组是必需的,因为它在模式中使用(?1)
子程序。
Pattern details :图案详情:
xyz
- xyz
text xyz
- xyz
文本(\\((?:[^()]++|(?1))*\\))
- Group 1: (\\((?:[^()]++|(?1))*\\))
- 第 1 组:
\\(
- a (
char \\(
- a (
字符(?:[^()]++|(?1))*
- zero or more repetitions of one or more chars other than (
and )
or the subroutine repeats (recurses) the whole Group 1 pattern (?:[^()]++|(?1))*
- 除(
和)
之外的一个或多个字符的零次或多次重复或子程序重复(递归)整个第 1 组模式\\)
- a )
char. \\)
- a )
字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.