简体   繁体   English

我的负前瞻不起作用 - 为什么?

[英]My negative lookahead is not working - why?

I have a text scattered with various strings, dates, tab characters and language codes.我有一个散布着各种字符串、日期、制表符和语言代码的文本。 I want to extract the strings that follow a date+tab combination, and which are followed by a language code like '[en]', a tab character, and after which we don't have the string "BAD THINGS" (eg "2020-01-12\tSTRING WE NEED[en]\tGOOD THINGS" , as opposed to "2020-01-12\tSTRING WE DON'T NEED[en]\tBAD THINGS" ).我想提取日期+制表符组合之后的字符串,然后是像'[en]'这样的语言代码,一个制表符,之后我们没有字符串“BAD THINGS”(例如“ 2020-01-12\tSTRING WE NEED[en]\tGOOD THINGS" ,而不是"2020-01-12\tSTRING WE DON'T NEED[en]\tBAD THINGS" )。

Here is a short example text of what I'm working with:这是我正在使用的简短示例文本:

\n2021-01-12\tThis string is not needed [it]\tBad things\tBad things\n2021-01-12\tThis string is also not needed [en]\tBad things\tBad things\n2021-01-11\tString 1 that is needed. \n2021-01-12\t不需要这个字符串 [it]\tBad things\tBad things\n2021-01-12\t这个字符串也不需要 [en]\tBad things\tBad things\n2021-01-11\ tString 1 是需要的。 [it]\tString 1 that is needed. [it]\tString 1 是需要的。 is repeated here\tNot interesting here\n2021-01-11\tString 2 that is needed [fr]\tString 2 that is needed is repeated here\tUnnecessary string\n2021-01-11\tString 3 that is needed... [ru]\tString 3 that is needed... is repeated here\tAnother part we're not interested in此处重复\t此处不感兴趣\n2021-01-11\t需要的String 2 [fr]\t此处重复需要的String 2\t不必要的字符串\n2021-01-11\t需要的String 3... [ ru]\tString 3 that is required...在这里重复\t我们不感兴趣的另一部分

I made this regex to capture all strings between a date and a language code:我制作了这个正则表达式来捕获日期和语言代码之间的所有字符串:

(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)

This works fine (see here ).这很好用(见这里)。 However, when I add a negative lookahead to exclude those followed by "Bad things", all my regex goes south:但是,当我添加一个否定的前瞻来排除那些后面跟着“坏事”的人时,我所有的正则表达式都会向南:

(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)(?!Bad things)

You can see the result here .你可以在这里看到结果。 I understand my lookahead somehow makes the regex greedy, but I have no idea how to avoid this, adding a?我知道我的前瞻不知何故使正则表达式变得贪婪,但我不知道如何避免这种情况,添加一个? after it doesn't work.在它不起作用之后。 Can you help me out here?你能帮帮我吗?

Not sure if this will cover all the cases but this seems to work:不确定这是否会涵盖所有情况,但这似乎可行:

(\d{4}-\d{2}-\d{2}\\t)([^][]*)(\[\w{2}\]\\t)(?!Bad things)

Demo here .演示在这里

Explanation:解释:

(\d{4}-\d{2}-\d{2}\\t)   date and tab
([^][]*)                 collect only things that do not contain chars `[` and `]`   
(\[\w{2}\]\\t)           follow up [<tag>]
(?!Bad things)           Negative Lookahead

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM