简体   繁体   English

正则表达式提取Java中不区分大小写的子字符串

[英]Regular expression to extract case insensitive substring in Java

I am trying to extract the GRANT number from the a paragraph. 我试图从a段落中提取GRANT号。 The grant number are usually aplhanumeric, with capital alphabets and can have - in between, but they would all occur without any spaces. 该授权号通常aplhanumeric,用大写字母,可以有-之间,但都会发生,他们没有任何空间。

Following are some examples of grants: 以下是一些赠款示例:

  • W9124A-18-0001
  • 007-FY2018
  • W81XWH18PRMRPTTDA
  • 07-544

Now I am not even certain if a paragraph would have a grant or not, so currently I am relying on the word grant to be present just before the grant number. 现在我也不肯定,如果一个段落将有补助或没有,所以我目前依托字grant只是授权号之前存在。


Example : 范例

This research was supported by NIH/NHLBI Grant W9124A-18-0001(PI, Michael Brown)

I tried to use following regex 我试图使用以下正则表达式

(?i)grant [A-Z0-9-]*

but its not perfect and matches Grant w9124A-18-0001 when it shouldn't (lowercase w ). 但它并不完美,并且在不应该的情况下与Grant w9124A-18-0001匹配(小写w )。 How can I improve it? 我该如何改善?

You can use the expression: 您可以使用以下表达式:

(?i)(?<=Grant\s)(?-i)[A-Z0-9-]+\b
  • (?i) Case insensitive. (?i)不区分大小写。
  • (?<=Grant\\s) Positive lookbehind for Grant followed by whitespace. (?<=Grant\\s)为正回顾后Grant接着空格。
  • [A-Z0-9-]+ Match digits, alphabetic characters and dashes - . [A-Z0-9-]+匹配数字,字母字符和破折号-
  • (?-i) Turn off case insensitivity. (?-i)关闭不区分大小写的代码。
  • \\b Word boundary. \\b字边界。

You can try it live here . 您可以在这里试用。

Turning on the case sensitivity, instead of disabling insensitivity, with: 使用以下方法打开区分大小写,而不是禁用不区分大小写:

(?i)(?<=Grant\s)(?c)[A-Z0-9-]+\b

is only supported by Tcl . 仅由Tcl支持

Grant之后,您需要关闭不区分大小写的功能。

(?i)grant (?-i)[A-Z0-9-]*

Fundamentally speaking, you're not accounting for case. 从根本上讲,您并没有考虑到案例。 Your regex as it stands only looks for "grant" and would fail on "Grant". 您的正则表达式仅查找“ grant”,而对“ Grant”将失败。 Worse, your grant identifiers also have mixed case, and your regex isn't checking that , either. 更糟的是,您的资助标识也混杂的情况,你的正则表达式不检查 ,无论是。

The simplest way to solve this would be to ensure that your regex actually supported those values. 解决此问题的最简单方法是确保您的正则表达式实际上支持这些值。 You don't need anything too fancy here; 您在这里不需要任何花哨的东西。 just perform a simple matching. 只需执行简单的匹配即可。

[Gg]rant [A-Za-z0-9\-]+

Fancier matching - such as specific subgroup matching with respect to the hyphen-deliminated parts of the grant ID - are left as an exercise for the reader. 更好的匹配(例如,与授予ID的连字符号分隔部分相关的特定子组匹配)作为练习供读者阅读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM