[英]Matching Unicode Dashes in Java Regular Expressions?
I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). 我正在尝试使用Pattern.split()将Java正则表达式分解为“foo”和“bar”。 The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
“ - ”字符可能是几个短划线之一:ASCII' - ',em-dash,en-dash等。我构造了以下正则表达式:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. 如果我正确地阅读Pattern文档,那么当两边用空格包围时,应该捕获任何unicode破折号或ascii破折号。 I'm using the pattern as follows:
我使用的模式如下:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. 没有快乐。 For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!
对于下面的示例输入,未检测到破折号,titleSegmentSeparator.matcher(sectionTitle).find()返回false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. 为了确保我没有遗漏任何不寻常的字符实体,我使用System.out打印一些调试信息。 The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
输出如下 - 每个字符后跟(int)char的输出,它应该是它的'unicode代码点,不是吗?
Sample input: 样本输入:
Study Summary (1 of 10) – Competition
研究摘要(1/10) - 竞争
S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
S(83)t(116)u(117)d(100)y(121)(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) )(32)((40)1(49)(32)o(111)f(102)(32)1(49)0(48))(41)(32) - (8211)(32)C( 67)O(111)M(109)p(112)E(101)T(116)I(105)T(116)1(105)○(111)N(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! 在我看来,破折号是代码点8211,它应该与正则表达式匹配,但事实并非如此! What's going on here?
这里发生了什么?
You're mixing decimal ( 8211
) and hexadecimal ( 0x8211
). 你是混合十进制(
8211
)和十六进制( 0x8211
)。
\\x
and \\u\u003c/code> both expect a hexadecimal number, therefore you'd need to use
\—
to match the em-dash, not \舑
(and \\x2D
for the normal hyphen etc.).
\\x
和\\u\u003c/code>都期望一个十六进制数,因此您需要使用
\—
来匹配em-dash,而不是\舑
(和普通连字符的\\x2D
等)。
But why not simply use the Unicode property "Dash punctuation"?
但为什么不简单地使用Unicode属性“Dash标点符号”?
As a Java string:
"\\\\s\\\\p{Pd}\\\\s"
作为Java字符串:
"\\\\s\\\\p{Pd}\\\\s"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.