简体   繁体   中英

Extract one or more consecutive words with first capital letter

I did a regular expression to extract one or more consecutive words with first capital letter. I need it with accented letters, but those letters screw up the expression, generating false output.

This is the example: http://www.phpliveregex.com/p/eHE (select preg_match_all)

My regular expression:

/([ÁÉÍÓÚÑA-Z]+[a-záéíóúñ]*[\s]{0,1}){1,}/

Test string:

Esto es una prueba para extraer diferentes nombres de personas como Fernández Díaz, Logroño, la Comunidad Valenciana, o también siglas como AVE, y cualquier cosa que empiece por mayúscula y tenga una o varias palabras.

In this case, "úscula", "én" should not appear.

preg_match_all('/(\\b\\p{Lu}\\p{L}+\\s*)+/u', $input, $output);

That's assuming "word" consists of letters only and only words separated by whitespace characters are considered consecutive.

Demo: http://www.phpliveregex.com/p/eHG

As indicated in comments, the way to match letters including all accented versions, is to make use of the \\p escape sequence in combination with the u (unicode) modifier:

additional escape sequences to match generic character types are available when UTF-8 mode is selected.

\\p{xx}
a character with the xx property

L Letter Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter

You could thus use this regex:

\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+

This expression checks that the match does not start with a horizontal white space ( \\h ) nor a comma, but then matches words separated by those. You could remove the comma if this is not what you want, or on the other hand add other punctuation to that list if you want.

Note that PHP requires that you use braces when you put more than one letter after the \\p modifier.

See PHP Live Regex

Example code (see it on eval.in ):

$text = "Esto es una prueba para extraer diferentes nombres de personas " .
        "como Fernández Díaz, Logroño, la Comunidad Valenciana, o también " .
        "siglas como AVE, y cualquier cosa que empiece por mayúscula " .
        "y tenga una o varias palabras.";

preg_match_all('/\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+/u', $text, $matches); 

var_export($matches);

Output:

array (
  0 => 
  array (
    0 => 'Esto',
    1 => 'Fernández Díaz, Logroño',
    2 => 'Comunidad Valenciana',
    3 => 'AVE',
  ),
)

Without the commas in the regex, 'Fernández Díaz, Logroño' would end up in separate matches:

array (
  0 => 
  array (
    0 => 'Esto',
    1 => 'Fernández Díaz',
    2 => 'Logroño',
    3 => 'Comunidad Valenciana',
    4 => 'AVE',
  ),
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM