在Java正则表达式中匹配Unicode虚线？

Question

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). 我正在尝试使用Pattern.split（）将Java正则表达式分解为“foo”和“bar”。 The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression: “ - ”字符可能是几个短划线之一：ASCII' - '，em-dash，en-dash等。我构造了以下正则表达式：

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. 如果我正确地阅读Pattern文档，那么当两边用空格包围时，应该捕获任何unicode破折号或ascii破折号。 I'm using the pattern as follows: 我使用的模式如下：

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

No joy. 没有快乐。 For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false! 对于下面的示例输入，未检测到破折号，titleSegmentSeparator.matcher（sectionTitle）.find（）返回false！

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. 为了确保我没有遗漏任何不寻常的字符实体，我使用System.out打印一些调试信息。 The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no? 输出如下 - 每个字符后跟（int）char的输出，它应该是它的'unicode代码点，不是吗？

Sample input: 样本输入：

Study Summary (1 of 10) – Competition 研究摘要（1/10） - 竞争

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110) S（83）t（116）u（117）d（100）y（121）（32）S（83）u（117）m（109）m（109）a（97）r（114）y（121））（32）（（40）1（49）（32）o（111）f（102）（32）1（49）0（48））（41）（32） - （8211）（32）C（ 67）O（111）M（109）p（112）E（101）T（116）I（105）T（116）1（105）○（111）N（110）

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! 在我看来，破折号是代码点8211，它应该与正则表达式匹配，但事实并非如此！ What's going on here? 这里发生了什么？

Answer 1

You're mixing decimal ( 8211 ) and hexadecimal ( 0x8211 ). 你是混合十进制（ 8211 ）和十六进制（ 0x8211 ）。

\\x and \\u\u003c/code> both expect a hexadecimal number, therefore you'd need to use \— to match the em-dash, not \舑 (and \\x2D for the normal hyphen etc.). \\x和\\u\u003c/code>都期望一个十六进制数，因此您需要使用\—来匹配em-dash，而不是\舑（和普通连字符的\\x2D等）。

But why not simply use the Unicode property "Dash punctuation"? 但为什么不简单地使用Unicode属性“Dash标点符号”？

As a Java string: "\\\\s\\\\p{Pd}\\\\s" 作为Java字符串： "\\\\s\\\\p{Pd}\\\\s"

在Java正则表达式中匹配Unicode虚线？

问题描述

1 个解决方案

解决方案1
12 已采纳 2010-06-15 13:37:43

在Java正则表达式中匹配Unicode虚线？

问题描述

1 个解决方案

解决方案1 12 已采纳 2010-06-15 13:37:43

解决方案1
12 已采纳 2010-06-15 13:37:43