简体   繁体   English

Python正则表达式负向后看

[英]Python Regex Negative Lookbehind

I have a large database of CT scan results and impressions. 我有一个大型的CT扫描结果和印象数据库。 I am attempting to build a regular expression which searches for an integer or floating number followed by 'mm' that is neighboring the word 'nodule' ahead or behind. 我正在尝试构建一个正则表达式,该表达式搜索整数或浮点数,后跟与单词“ nodule”相邻的“ mm”。 This is the regular expression I have for this so far: 到目前为止,这是我的正则表达式:

nodule_4mm_size = "(?s).*?([0-4]*\.*[0-9]+\s*[mM]{2})[\w\W]{0,24}[Nn]odule|(?s)[Nn]odule[\w\W]{0,24}.*?([0-4]*\.*[0-9]+\s*[mM]{2})”

However, I need to ensure that these findings are not preceded by previous or prior measurements. 但是,我需要确保在这些发现之前没有进行之前或之前的测量。 Radiologists referring to previous scans. 放射科医生指的是以前的扫描。 So I am trying a negative lookbehind, like this: 因此,我尝试在后面进行负向查找,如下所示:

(?<!previously measured)\?[Nn]odule[\w\W]{0,24}[^\.\d]([0-4]\s*[mM]{2}|[0-3]\.[0-9]\s*[mM]{2}|4\.0+\s*[mM]{2})

However, I can't get it to work. 但是,我无法使其正常工作。 Take for instance the following paragraph. 以下面的段落为例。

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011." “例如,最大的结节位于右下叶,目前尺寸为4.4毫米(图像#82,系列3),先前在2011年9月1日测量为3.6毫米。”

In this case, I would like the regex to hit on 4.4 mm not 3.6 mm. 在这种情况下,我希望将正则表达式打到4.4毫米而不是3.6毫米。 Furthermore, if multiple hits are found I would like to only keep the largest size found. 此外,如果找到多个匹配,我只想保留找到的最大匹配。 For example, 例如,

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011. Another nodule was found measuring 2.2 mm. “例如,最大的结节位于右下叶,目前测量为4.4毫米(图像#82,系列3),先前在2011年9月1日测量为3.6毫米。发现另一个结节为2.2毫米。

In this case I would like to ensure only 4.4 mm is identified. 在这种情况下,我要确保仅识别4.4毫米。

Any help would truly be appreciated. 任何帮助将不胜感激。 Just can't get this negative lookbehind to work! 只是无法让这种消极的后盾起作用! Thanks! 谢谢!

Two possibilities: 两种可能性:

1) using lookbehinds: 1)使用lookbehinds:

(?<!previously measured )(?<![0-9.])([0-9]+(?:\.[0-9]+)?) ?mm

The first checks if "previously measured " is not before the number, the second checks if there are no digits or a dot before the number (otherwise the 4 after the dot will match. Keep in mind that a regex engine returns the first result on the left). 第一个检查"previously measured "是否不在数字之前,第二个检查是否在数字之前没有数字或点(否则,点后的4个将匹配。请记住,正则表达式引擎将返回第一个结果)左边)。

2) using capture groups: 2)使用捕获组:

previously measured [0-9]+(?:\.[0-9]+)? ?mm|([0-9]+(?:\.[0-9]+)?) ?mm

The idea is to match what you want to avoid before. 想法是匹配您想要避免的事。 When the capture group 1 exists, you have got a result. 当捕获组1存在时,您将获得结果。

About the biggest number, use the re.findall method and take the biggest result after (a regex can't solve this kind of things). 关于最大数量,请使用re.findall方法并获得最大的结果(正则表达式无法解决此类问题)。

If there need to be nodule word nearby, you can try with: 如果附近需要nodule词,可以尝试:

(?:((?<!previously measured\s)\d+.\d+\s*mm)(?:[^.?!\n]*?)?nodule|nodule(?:[^.?!\n]*?((?<!previously measured\s)\d+.\d+\s*mm))?)

DEMO 演示

It will match if: 如果满足以下条件,它将匹配:

  • the nodule is in the same sentence as value in mm (the [^.?!\\n] should prevent it, however word like Mr.,decimals, etc. will disturb the match), you can replace it with .+? 结节与以mm为单位的值在同一句子中( [^.?!\\n]应该阻止它,但是像Mr.,decimals等之类的字词会干扰匹配),您可以将其替换为.+? ( DEMO ) however it could match between sentences DEMO )但是它可以在句子之间匹配
  • the value is before, or after word nodule (in this oreder, if there is value before, it will be matched first), 该值在字根结节之前或之后(在此oreder中,如果之前有值,将首先匹配),
  • values will be captured in groups: before - \\1, after - \\2, 值将以组的形式捕获:--1之前,-2之后,
  • it should be used with g and i modes 它应该与g和i模式一起使用

Other similar solution would be: 其他类似的解决方案是:

(?=((?<!previously measured\s)\d+.\d+ mm)[^.?!]+nodule)|(?=nodule[^.?!]+((?<!previously measured\s)\d+\.\d+ mm))

DEMO 演示

based only on lookarounds, it will not directly match text but zero-lenght position, and will capture values into groups. 仅基于环视,它不会直接匹配文本,而是零长度位置,并且会将值捕获到组中。

Let's break it down, keeping the relevant parts. 让我们分解一下,保留相关部分。 By now you have 2 options: 到目前为止,您有2个选择:

Option 1 (number followed by " nodule "): 选项1 (数字后跟“ nodule ”):

([0-4]\.\d+\s*[mM]{2})[\s\S]{0,24}[Nn]odule

Option 2 (" nodule " followed by number): 选项2 (“ nodule ”后跟数字):

[Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

You should know the regex engine is greedy . 您应该知道正则表达式引擎是贪婪的 It means that [\\s\\S]{1,24} will try to match as much as it can, matching the number that is not necessarily closest to " nodule ". 这意味着[\\s\\S]{1,24}将尝试尽可能匹配,匹配不一定最接近“ nodule ”的数字。 For example, 例如,

Pattern: [Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

Text: ... nodule measured 1.4 mm. Another 3.2 mm ...
                                          ^    ^
                                          |    |
          matches this second occurence.  +----+

To fix this, add an extra ? 要解决此问题,请添加一个额外的? after a quantifier to make it lazy . 经过量词使懒惰 So, instead of using [\\s\\S]{0,24} , use [\\s\\S]{0,24}? 因此,不是使用[\\s\\S]{0,24} ,而是使用[\\s\\S]{0,24}? .


For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm 例如,最大的结节位于右下叶,目前尺寸为4.4毫米

This example here has " nodule " separated by more than 24 chars. 此示例的“ nodule ”间隔超过24个字符。 You should increase the number of characters in between. 您应该增加中间的字符数。 Maybe [\\s\\S]{0,70}? 也许[\\s\\S]{0,70}? .


So I am trying a negative lookbehind 所以我正在尝试负面的回望

Lookbehinds only assert text that is immediately before a certain position. Lookbehinds仅声明某个位置之前的文本。 To avoid it, I recommend matching the text " previously measured ", consuming some characters around it. 为避免这种情况,我建议匹配文本“ previously measured ”,并在文本周围使用一些字符。 So, how do you know not to consider those cases? 那么,您怎么知道不考虑这些情况? Easy, don't create a capture. 容易,不要创建捕获。 So you will be matching something like 因此,您将匹配类似

[\s\S]{0,10}previously measured[\s\S]{0,10}

and discarding the match because it didn't return any groups. 并取消匹配,因为它没有返回任何组。 Moreover, you could include different exceptions here: 此外,您可以在此处包括不同的例外:

[\s\S]{0,10}(?:previously measured|previous scan|another patient|incorrectly measured)[\s\S]{0,10}

if multiple hits are found I would like to only keep the largest size found 如果找到多个匹配,我只想保留找到的最大匹配

You can't do that with regex. 使用regex不能做到这一点。 Loop in your code to find the largest. 循环输入代码以查找最大的代码。


Result: 结果:

With these conditions, we have: 在这些条件下,我们有:

[\s\S]{0,10}previously measured[\s\S]{0,10}|([0-4]\.\d+\s*[mM]{2})[\s\S]{0,70}?[Nn]odule|[Nn]odule[\s\S]{0,70}?([0-4]\.\d+\s*[mM]{2})

DEMO 演示


Extra conditions to check 额外条件要检查

Maybe, one of the following options turns useful in order to reduce false positives: 也许,以下选项之一对减少误报很有用:

  1. Don't allow to match after a newline. 不允许在换行符之后进行匹配。
  2. Don't match if there's a full stop between " nodule " and the number. 如果“ nodule ”和数字之间没有句号,则不匹配。
  3. Look for a date near the measure. 在小节附近寻找日期。

In regard to this problem I ended up tokenizing the reports into individual sentences using the nltk module. 关于这个问题,我最终使用nltk模块将报告标记为单个句子。 The final regex expression which works for all instances is: 适用于所有实例的最终正则表达式为:

nodule_search = "[\s\S]{0,10}(?:previously measured|compared to )[\s\S]{0,10}|(\d[\.,]\d+|\d+|\d\d[\.,]\d+)\s*[mM]{2}[\s\S]{0,40}?[Nn]odule|[Nn]odule[\s\S]{0,40}?(\d[\.,]\d+|\d+|\d\d[\.,]\d+)\s*[mM]{2}"

So in this instance I ended up not doing a negative lookbehind but did a capture groups instead. 因此,在这种情况下,我没有在后面进行否定的查找,而是做了一个捕获组。

Thanks everyone for your input. 谢谢各位的意见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM