我的负前瞻不起作用 - 为什么？

Question

I have a text scattered with various strings, dates, tab characters and language codes.我有一个散布着各种字符串、日期、制表符和语言代码的文本。 I want to extract the strings that follow a date+tab combination, and which are followed by a language code like '[en]', a tab character, and after which we don't have the string "BAD THINGS" (eg "2020-01-12\tSTRING WE NEED[en]\tGOOD THINGS" , as opposed to "2020-01-12\tSTRING WE DON'T NEED[en]\tBAD THINGS" ).我想提取日期+制表符组合之后的字符串，然后是像'[en]'这样的语言代码，一个制表符，之后我们没有字符串“BAD THINGS”（例如“ 2020-01-12\tSTRING WE NEED[en]\tGOOD THINGS" ，而不是"2020-01-12\tSTRING WE DON'T NEED[en]\tBAD THINGS" ）。

Here is a short example text of what I'm working with:这是我正在使用的简短示例文本：

\n2021-01-12\tThis string is not needed [it]\tBad things\tBad things\n2021-01-12\tThis string is also not needed [en]\tBad things\tBad things\n2021-01-11\tString 1 that is needed. \n2021-01-12\t不需要这个字符串 [it]\tBad things\tBad things\n2021-01-12\t这个字符串也不需要 [en]\tBad things\tBad things\n2021-01-11\ tString 1 是需要的。 [it]\tString 1 that is needed. [it]\tString 1 是需要的。 is repeated here\tNot interesting here\n2021-01-11\tString 2 that is needed [fr]\tString 2 that is needed is repeated here\tUnnecessary string\n2021-01-11\tString 3 that is needed... [ru]\tString 3 that is needed... is repeated here\tAnother part we're not interested in此处重复\t此处不感兴趣\n2021-01-11\t需要的String 2 [fr]\t此处重复需要的String 2\t不必要的字符串\n2021-01-11\t需要的String 3... [ ru]\tString 3 that is required...在这里重复\t我们不感兴趣的另一部分

I made this regex to capture all strings between a date and a language code:我制作了这个正则表达式来捕获日期和语言代码之间的所有字符串：

(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)

This works fine (see here ).这很好用（见这里）。 However, when I add a negative lookahead to exclude those followed by "Bad things", all my regex goes south:但是，当我添加一个否定的前瞻来排除那些后面跟着“坏事”的人时，我所有的正则表达式都会向南：

(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)(?!Bad things)

You can see the result here .你可以在这里看到结果。 I understand my lookahead somehow makes the regex greedy, but I have no idea how to avoid this, adding a?我知道我的前瞻不知何故使正则表达式变得贪婪，但我不知道如何避免这种情况，添加一个？ after it doesn't work.在它不起作用之后。 Can you help me out here?你能帮帮我吗？

Answer 1

Not sure if this will cover all the cases but this seems to work:不确定这是否会涵盖所有情况，但这似乎可行：

(\d{4}-\d{2}-\d{2}\\t)([^][]*)(\[\w{2}\]\\t)(?!Bad things)

Demo here .演示在这里。

Explanation:解释：

(\d{4}-\d{2}-\d{2}\\t)   date and tab
([^][]*)                 collect only things that do not contain chars `[` and `]`   
(\[\w{2}\]\\t)           follow up [<tag>]
(?!Bad things)           Negative Lookahead

我的负前瞻不起作用 - 为什么？

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-12 17:06:04

我的负前瞻不起作用 - 为什么？

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-12 17:06:04

解决方案1
3 已采纳 2021-01-12 17:06:04