Java treat polish letter like ó
not as a letter \\w
. Now I don't know how to write a regex to fulfill all following unit tests.
How to change BEFORE_LANGUAGE
and AFTER_LANGUAGE
to fulfull the tests:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.junit.Test;
import junit.framework.TestCase;
public class tmpTest extends TestCase{
final String BEFORE_LANGUAGE = "(?<![\\w\\p{S}])";
final String AFTER_LANGUAGE = "\\d*((?![\\w\\p{S}])|(<))";
@Test
public void test1() {
// Given:
String language = ".net";
String text = "xxxxxxx xxx .net";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertTrue(m.find());
}
@Test
public void test2() {
// Given:
String language = ".net";
String text = "xxxxxxx xxx .net<br>";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertTrue(m.find());
}
@Test
public void test3() {
// Given:
String language = "c++";
String text = "xxxxxxx xxx c++";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertTrue(m.find());
}
@Test
public void test4() {
// Given:
String language = "c";
String text = "xxxxxxx xxx c++";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertFalse(m.find());
}
@Test
public void test5() {
// Given:
String language = "r";
String text = "xxxxxxx xxx różne";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertFalse(m.find());
}
@Test
public void test6() {
// Given:
String language = "r";
String text = "xxxxxxx xxx r";
String regex = BEFORE_LANGUAGE + Pattern.quote(language) + AFTER_LANGUAGE;
// When:
Matcher m = Pattern.compile(regex).matcher(text);
// Then:
assertTrue(m.find());
}
}
According to here , \\p{IsAlphabetic}
matches anything in Unicode that is considered a letter.
\\w
also includes 0-9
, so you need to put \\d
in the character class as well.
So,
[\p{IsAlphabetic}\d]
To make \\w
and other shorthand character classes Unicode-aware pass the Pattern.UNICODE_CHARACTER_CLASS
flag to the compiled pattern:
Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS).matcher(text);
No need to rewrite the current pattern.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.