简体   繁体   English

php regex - 在文本中查找带有数字和空格的大写字符串

[英]php regex - find uppercase string with number and spaces in text

I want to write php regular expression to find uppercase string , which can also contain one number and spaces, from text. 我想编写php正则表达式来查找大写字符串,也可以从文本中包含一个数字和空格。

For example from this text "some text to contain EXAM PL E 7STRING uppercase word" I want to get string- EXAM PL E 7STRING , 例如,从这个文本"some text to contain EXAM PL E 7STRING uppercase word"我想得到字符串EXAM PL E 7STRING

found string should start and end only with uppercase, but in the middle, without uppercase letters can also contain(but not necessarily ) one number and spaces. 找到的字符串应该只用大写开头和结尾,但在中间,没有大写字母也可以包含(但不一定)一个数字和空格。 So, regex should match any of these patterns 因此,正则表达式应匹配任何这些模式

1) EXAMPLESTRING               - just uppercase string
2) EXAMP4LESTRING              - with number
3) EXAMPLES TRING              - with space
4) EXAM PL E STRING            - with more than one spaces
5) EXAMP LE4STRING             - with number and space
6) EXAMP LE 4ST RI NG          - with number and spaces 

and with total length string should be equal or more than 4 letters 并且字符串的总长度应等于或大于4个字母

I wrote this regex '/[AZ]{1,}([AZ\\s]{2,}|\\d?)[AZ]{1,}/' , that can find first 4 patterns, but I can not figure it out to match also the last 2 patterns. 我写了这个正则表达式'/[AZ]{1,}([AZ\\s]{2,}|\\d?)[AZ]{1,}/' ,可以找到前4个模式,但我无法想象它也匹配最后2个模式。

Thanks 谢谢

There is a neat trick called a lookahead . 有一个巧妙的技巧叫先行 It just checks what is following after the current position. 它只检查当前位置后面的内容。 That can be used to check for multiple conditions: 这可用于检查多个条件:

'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])(?!(?:[A-Z\s]*\d){2})[A-Z][A-Z\s\d]*[A-Z]/'

The first lookaround is actually a lookbehind and checks that there is no previous uppercase letter. 第一个环视实际上是一个lookbehind并检查没有以前的大写字母。 This is just a little speedup for strings that would fail the match anyway. 对于无论如何都会使比赛失败的字符串来说,这只是一点加速。 The second lookaround (a lookahead) checks that there are at least four letters. 第二个环视(前瞻)检查至少有四个字母。 The third one checks that there are no two digits. 第三个检查没有两个数字。 The rest just matches then a string of the allowed characters, starting and ending with an uppercase letter. 其余的只是匹配一串允许的字符,以大写字母开头和结尾。

Note that in the case of two digits this will not match at all (instead of matching everything up to the second digit). 请注意,在两位数的情况下,这根本不匹配(而不是将所有内容匹配到第二位)。 If you do want to match in such a case, you could incorporate the "1 digit" rule into the actual match instead: 如果您确实希望在这种情况下匹配,则可以将“1位数”规则合并到实际匹配中:

'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])[A-Z][A-Z\s]*\d?[A-Z\s]*[A-Z]/'

EDIT: 编辑:

As Ωmega pointed out, this will cause problems if there are less then four letters before the second digit, but more after that. 正如Ωmega指出的那样,如果在第二个数字之前有少于四个字母,这将导致问题,但在此之后更多。 This is actually quite tough, because the assertion needs to be, that there are more than 4 letters before the second digit. 这实际上非常棘手,因为断言需要在第二个数字之前有超过4个字母。 Since we do not know where the first digit occurs in those four letters, we have to check for all possible positions. 由于我们不知道这四个字母中第一个数字的位置,我们必须检查所有可能的位置。 For this I would do away with the lookaheads altogether, and simply provide the three different alternatives. 为此,我将完全废除前瞻,并简单地提供三种不同的替代方案。 (I will keep the lookbehind as an optimization for non-matching parts.) (我将保持后视作为非匹配部件的优化。)

'/(?<![A-Z])[A-Z]\s*(?:\d\s*[A-Z]\s*[A-Z]|[A-Z]\s*\d\s*[A-Z]|[A-Z]\s*[A-Z][A-Z\s]*\d?)[A-Z\s]*[A-Z]/'

Or here with added comments: 或在此处添加注释:

'/
(?<!         # negative lookbehind
    [A-Z]    # current position is not preceded by a letter
)            # end of lookbehind
[A-Z]        # match has to start with uppercase letter
\s*          # optional spaces after first letter
(?:          # subpattern for possible digit positions
    \d\s*[A-Z]\s*[A-Z]
             # digit comes after first letter, we need two more letters before last one
|            # OR
    [A-Z]\s*\d\s*[A-Z]
             # digit comes after second letter, we need one more letter before last one
|            # OR
    [A-Z]\s*[A-Z][A-Z\s]*\d?
             # digit comes after third letter, or later, or not at all
)            # end of subpattern for possible digit positions
[A-Z\s]*     # arbitrary amount of further letters and whitespace
[A-Z]        # match has to end with uppercase letter
/x'

That gives the same result on Ωmega's lengthy test input. 这对Ωmega的冗长测试输入提供了相同的结果。

I suggest to use regex pattern 我建议使用正则表达式

 
 
 
  
  [AZ][ ]*(\\d)?(?(1)(?:[ ]*[AZ]){3,}|[AZ][ ]*(\\d)?(?(2)(?:[ ]*[AZ]){2,}|[AZ][ ]*(\\d)?(?(3)(?:[ ]*[AZ]){2,}|[AZ][ ]*(?:\\d|(?:[ ]*[AZ])+[ ]*\\d?))))(?:[ ]*[AZ])*
 
  

(see this demo ). (见 本演示 )。

 [AZ][ ]*(?:\\d(?:[ ]*[AZ]){2}|[AZ][ ]*\\d[ ]*[AZ]|(?:[AZ][ ]*){2,}\\d?)[AZ ]*[AZ] 

(see this demo ) (见本演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM