简体   繁体   English

Python 正则表达式查找括号内的所有内容,事先带有前缀

[英]Python Regex to find everything within parenthesis, with a prefix beforehand

This seems like a fairly simple issue, but I can't get it to work.这似乎是一个相当简单的问题,但我无法让它发挥作用。

I have a text file, which contains JSON like data, but there are a couple of additional lines, stopping it being a valid JSON and I need to remove these.我有一个文本文件,其中包含类似 JSON 的数据,但还有几行额外的行,阻止它成为有效的 JSON,我需要删除这些行。 This sounds very simple and even more so, as the valid JSON strings (which I can parse later) are always contained in the following container:这听起来非常简单,甚至更简单,因为有效的 JSON 字符串(我可以稍后解析)始终包含在以下容器中:

xyz() xyz()

So for example, the dataset will be something like:例如,数据集将类似于:

abcdefg
xyz({"id_value": 123, "text_value": "efg"})

abcdefg
xyz({"id_value": 124, "text_value": "hij"})

Each separate JSON string is always prefixed by abcdefg and then xyz( and there is always a closing bracket after. So the format is consistent.每个单独的 JSON 字符串总是以 abcdefg 为前缀,然后是 xyz( 并且后面总是有一个右括号。所以格式是一致的。

I was trying the following:我正在尝试以下操作:

re.findall(r'xyz\(.*?\)', text_file)

However despite attempting variations of this (eg using re.search, trying \\w+ etc.) nothing seems to work (by which I mean it returns an empty list).然而,尽管尝试了这种变化(例如使用 re.search,尝试 \\w+ 等)似乎没有任何效果(我的意思是它返回一个空列表)。

If I just try to do the following:如果我只是尝试执行以下操作:

re.findall(r'xyz\(

Then it returns:然后它返回:

['xyz(', 'xyz(']

As expected.正如预期的那样。

So the issue appears to be with the string in the brackets, but I can not work out what the problem is, as other examples on here suggest my code is correct (which it can't be as it doesn't work)!所以问题似乎与括号中的字符串有关,但我无法弄清楚问题是什么,因为这里的其他示例表明我的代码是正确的(它不可能是因为它不起作用)!

I presume its something horrifically simple, but I'm a bit stuck!我认为它的东西非常简单,但我有点卡住了!

You can install PyPi regex module by rinning pip install regex (or pip3 install regex ) and then using this library to match strings between xyz( and the next paired ) char using:您可以通过 rinning pip install regex (或pip3 install regex )然后使用此库来匹配xyz(和下一个配对)字符之间的字符串,使用以下方法来安装 PyPi regex模块:

import regex 
#...
output = [x.group() for x in regex.finditer(r'xyz(\((?:[^()]++|(?1))*\))', text_file)

The list comprehension is used to avoid the issue with regex.findall when only captured substrings are returned when a capturing group is defined in the regex (and here, the capturing group around parentheses is required since it is recursed inside the pattern with a (?1) subroutine.当在正则表达式中定义捕获组时仅返回捕获的子字符串时,列表regex.findall用于避免regex.findall的问题(在这里,括号周围的捕获组是必需的,因为它在模式中使用(?1)子程序。

Pattern details :图案详情

  • xyz - xyz text xyz - xyz文本
  • (\\((?:[^()]++|(?1))*\\)) - Group 1: (\\((?:[^()]++|(?1))*\\)) - 第 1 组:
    • \\( - a ( char \\( - a (字符
    • (?:[^()]++|(?1))* - zero or more repetitions of one or more chars other than ( and ) or the subroutine repeats (recurses) the whole Group 1 pattern (?:[^()]++|(?1))* - 除()之外的一个或多个字符的零次或多次重复或子程序重复(递归)整个第 1 组模式
    • \\) - a ) char. \\) - a )字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM