简体   繁体   中英

I need to filter out words or word groups from a String via Regex

Good evening. I have a string like "leicht bewölkt leichter Regen Regen". I need a regex pattern that matches "leicht bewölkt" (two adjectives), "leichter Regen" (adjective and noun) and "Regen" (noun). I have found out, how I can match against an adjective "\\b[az][az]*\\b", but how can I do that with two adjectives or one adjective and a noun...? I'm a bit lost. Thanks in advance.

\\b[az][az]*\\b

A regex matching a single full word starting with an uppercase letter is easy to derive from your current regex, just replace the first character class by its uppercase equivalent :

\b[A-Z][a-z]*\b

Now we only need to combine the two to match the following patterns :

  • two words, both starting with lowercase letters (two adjectives)
  • two words, the first starting with a lowercase letter, the second with an uppercase (adjective and noun)
  • a single word starting with an uppercase letter (noun)

We can represent consecutive words by joining them with a single space character.

A basic solution will be an alternation of the three patterns listed above :

\b[a-z][a-z]*\b \b[a-z][a-z]*\b|\b[a-z][a-z]*\b \b[A-Z][a-z]*\b|\b[A-Z][a-z]*\b

^________two adjectives_______^ ^____one adjective one noun___^ ^__one  noun__^ 

It can be improved in multiple ways :

  • your regex for a single full lowercase can be written as \\b[az]+\\b ( + is "one or more", which is the same as one and then "0 or more" * )
  • there automatically is a word boundary between a character of [az] and a space, therefore the \\b after a word and before a space and those after a space and before a word can be removed, as they always will be matched if the word and the space are.
  • you could factorize the first two patterns as they both start with a lowercase word, or the last two patterns as they both end in a noun. I however think this would reduce readability and therefore maintainability so I will abstain

In conclusion, I would use the following :

\b[a-z]+ [a-z]+\b|\b[a-z]+ [A-Z][a-z]*\b|\b[A-Z][a-z]*\b

Testing it on regex101 shows you will have problems with non-ascii characters ( ö isn't matched by [az] and isn't considered a word character , unless the UNICODE flag is set ).

To handle the unicode problem you can use the \\p{Ll} "lowercase letters of any language" and \\p{Lu} "uppercase letters of any language" meta-characters in conjunction with the UNICODE flag / UNICODE_CHARACTER_CLASS for java (needed for \\b to work correctly) instead of your current character classes :

\b\p{Ll}+ \p{Ll}+\b|\b\p{Ll}+ \p{Lu}\p{L}*\b|\b\p{Lu}\p{Ll}*\b

( regex101 , java code on ideone )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM