简体   繁体   中英

Regex Character Class Subtraction with PHP

HI,

I'm trying to match UK postcodes, using the pattern from http://interim.cabinetoffice.gov.uk/media/291370/bs7666-v2-0-xsd-PostCodeType.htm ,

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z-[CIKMOV]]{2}$/

I'm using this in PHP, but it doesn't match the valid postcode OL13 0EF . This postcode does match, however, when I remove the -[CIKMOV] character class subtraction.

I get the impression that I'm doing character class subtraction wrong in PHP. I'd be most grateful if anyone could correct my error.

Thanks in advance for your help.

Ross

Most of the regex flavours do not support character class subtraction. Instead you could use look-ahead assertion:

/^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9](?!.?[CIKMOV])[A-Z]{2}$/

If class subtraction is not supported, you should be able to use negative classes to achieve subtractions.

Some examples are [^\\D] = \\d , [^[:^alpha:]] = [a-zA-Z]

Your problem could be solved like that, using a negative POSIX character class inside a character class like [^az[:^alpha:]CIKMOV]

[^
az # not az
[:^alpha:] # not not A-Za-z
CIKMOV # not C,I,K,M,O,V
]

Edit - This works too and might be easier to read: [^[:^alpha:][:lower:]CIKMOV]

[^
[:^alpha:] # A-Za-z
[:lower:] # not az
CIKMOV # not C,I,K,M,O,V
]

The result is a character class that is AZ without C,I,K,M,O,V
basically a subtraction.

Here is a test of 2 different class concoctions (in Perl):

use strict;
use warnings;

my $match = '';

   # ANYOF[^\0-@CIKMOV[-\377!utf8::IsAlpha]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z[:^alpha:]CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";
$match = '';

   # ANYOF[^\0-@CIKMOV[-\377+utf8::IsDigit !utf8::IsWord]
for (0 .. 255) {
   if (chr($_) =~ /^[^a-z\d\W_CIKMOV]$/) {
       $match .= chr($_); next;
   }
   $match .= ' ';
}
$match =~ s/^ +//;
$match =~ s/ +$//;
print "'$match'\n";

Output shows the discontinuation in AZ minus CIKMOV, from tested ascii characters 0-255:
'AB DEFGH JLN PQRSTU WXYZ'
'AB DEFGH JLN PQRSTU WXYZ'

PCRE does not support char class subtraction.

So you can enumerate all the uppercase letters except CIKMOV :

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}$

which can be shorted using range as:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-JLNP-UW-Z]{2}$

I think you're going to have to replace [AZ-[CIKMOV]] with [ABD-HJLNP-UW-Z] . I don't think php supports character class substraction. My alternative reads something like "A, B, D to H, J, L, N, P to U, and W to Z".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM