简体   繁体   中英

Regular expressions: how to make java treat polish letters as normal \w?

Java treat polish letter like ó not as a letter \\w . Now I don't know how to write a regex to fulfill all following unit tests.

How to change BEFORE_LANGUAGE and AFTER_LANGUAGE to fulfull the tests:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.junit.Test;

import junit.framework.TestCase;

public class tmpTest extends TestCase{

    final String BEFORE_LANGUAGE = "(?<![\\w\\p{S}])";
    final String AFTER_LANGUAGE = "\\d*((?![\\w\\p{S}])|(<))";


    @Test
    public void test1() {
        // Given:
        String language = ".net";
        String text = "xxxxxxx xxx .net";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertTrue(m.find());
    }

    @Test
    public void test2() {
        // Given:
        String language = ".net";
        String text = "xxxxxxx xxx .net<br>";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertTrue(m.find());
    }

    @Test
    public void test3() {
        // Given:
        String language = "c++";
        String text = "xxxxxxx xxx c++";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertTrue(m.find());
    }

    @Test
    public void test4() {
        // Given:
        String language = "c";
        String text = "xxxxxxx xxx c++";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertFalse(m.find());
    }

    @Test
    public void test5() {
        // Given:
        String language = "r";
        String text = "xxxxxxx xxx różne";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertFalse(m.find());
    }

    @Test
    public void test6() {
        // Given:
        String language = "r";
        String text = "xxxxxxx xxx r";
        String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;

        // When: 
        Matcher m = Pattern.compile(regex).matcher(text);

        // Then:
        assertTrue(m.find());
    }

}

According to here , \\p{IsAlphabetic} matches anything in Unicode that is considered a letter.

\\w also includes 0-9 , so you need to put \\d in the character class as well.

So,

[\p{IsAlphabetic}\d]

To make \\w and other shorthand character classes Unicode-aware pass the Pattern.UNICODE_CHARACTER_CLASS flag to the compiled pattern:

Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS).matcher(text);

No need to rewrite the current pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM