简体   繁体   English

为什么非破坏空间不是java中的空白字符?

[英]Why is non-breaking space not a whitespace character in java?

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on java's spartan definition of String.trim() which is at least properly documented. 在寻找一种从解析的HTML中修剪不间断空间的正确方法的同时,我首先偶然发现了java的String.trim()的spartan定义,该定义至少是正确记录的。 I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on Character class would do the job for me. 我想避免明确列出符合修剪条件的字符,所以我假设在Character类上使用Unicode支持的方法可以帮我完成工作。

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces: 那时我发现Character.isWhitespace(char)明确排除了不间断的空格:

It is a Unicode space character ( SPACE_SEPARATOR , LINE_SEPARATOR , or PARAGRAPH_SEPARATOR ) but is not also a non-breaking space ( '\ ' , '\ ' , '\ ' ). 它是一个Unicode空格字符( SPACE_SEPARATORLINE_SEPARATORPARAGRAPH_SEPARATOR ), 但也不是一个不间断的空格'\ ''\ ''\ ' )。

Why is that? 这是为什么?

The implementation of corresponding .NET equivalent is less discriminating. 相应的.NET等价物的实现不那么有区别。

Character.isWhitespace(char) is old. Character.isWhitespace(char)很旧。 Really old. 真的老了。 Many things done in the early days of Java followed conventions and implementations from C. Java早期的许多事情都遵循C的约定和实现。

Now, more than a decade later, these things seem erroneous. 现在,十多年后,这些事情似乎是错误的。 Consider it evidence how far things have come, even between the first days of Java and the first days of .NET. 考虑一下即使在Java的第一天和.NET的第一天之间已经发生了多大的事情。

Java strives to be 100% backward compatible. Java力求100%向后兼容。 So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does. 因此,即使Java团队认为修复他们的初始错误并在从Character.isWhitespace(char)返回true的字符集中添加不间断空格也是好的,他们不能,因为几乎可以肯定存在软件依赖于当前实现的工作方式。

Since Java 5 there is also an isSpaceChar(int) method. 从Java 5开始,还有一个isSpaceChar(int)方法。 Does that not do what you want? 那不是你想做的吗?

Determines if the specified character (Unicode code point) is a Unicode space character. 确定指定的字符(Unicode代码点)是否为Unicode空格字符。 A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. 当且仅当字符被Unicode标准指定为空格字符时,才将字符视为空格字符。 This method returns true if the character's general category type is any of the following: ... 如果角色的常规类别类型是以下任何一种,则此方法返回true:...

As posted above, isSpaceChar(int) will provide the OP with a track to the answer. 如上所述, isSpaceChar(int)将为OP提供跟踪答案。 It seems fairly discreetly documented, but this method is actually useable with regexes . 它看起来相当谨慎,但这种方法实际上可用于正则表达式 So: 所以:

    "X\u00A0X X".replaceAll("\\p{javaSpaceChar}", "_");

will produce a "X_X_X" string. 将生成一个“X_X_X”字符串。 It is left as an exercise for the reader to come up with the regex to trim a string. 它留给练习者读取正则表达式以修剪字符串。 (Pattern with some flags should do the trick.) (带有一些标志的模式应该可以解决问题。)

I would argue that Java's implementation is more correct than .NET's. 我认为Java的实现比.NET更正确。 The non-breaking space is essentially a non-whitespace character that looks like one. 不间断的空间本质上是一个非空白字符,看起来像一个。 That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. 也就是说,如果你有字符串“foo”和“bar”,并在它们之间放置任何传统的空白字符,你就会得到一个单词分隔符。 A non-breaking space, however, does not break the two up. 然而,一个不间断的空间并没有打破这两个空间。

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text. 应该特别处理不间断空间的唯一时间是使用设计用于执行文本自动换行的代码。

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace . 出于所有其他目的,包括字数,修剪和沿着字边界的通用分割,不间断的空间仍然是空白

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed. 任何一个非破坏性空间只是“看起来像”一个空间而不是一个空间的论点与Unicode的整个点相冲突,Unicode表示基于其含义的字符,而不是它们的显示方式。

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault. 因此,恕我直言,String.trim()的Java实现没有按预期执行,并且底层的Character.isWhitespace()函数有问题。

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. 我的猜测是,Java实现者根据在控件中执行文本换行的需要编写了isWhitespace()。 They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim(). 他们应该将此函数命名为isWordWrappingBoundary()或更清晰的东西,并对trim()使用限制较少的空白测试。

It looks like the method name ( isWhitespace ) is inconsistent with its function (to detect separators). 看起来方法名称( isWhitespace )与其功能(检测分隔符)不一致。 The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted: 如果您查看所引用的Javadoc页面中的完整字符列表,“分隔符”功能就相当清楚了:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR. 

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms. 非破坏空间的功能应该是不被连字算法分隔的单词之间的可视空间。

使用具有相同奇怪的isWhitespace行为的apache commons函数StringUtils.isBlank() (及相关函数)时也要小心,即不间断空格被认为是非空白的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache POI 异常空白(已解决:\  不间断空格) - Apache POI Anomalous Whitespace (Resolved: \u00A0 non-breaking space) 为什么当我尝试从字符串中删除不间断空格时,我没有得到预期的结果? - Why when I try to remove non-breaking space from the string, I do not get the expected result? 使用Selenium查找其中包含非破坏空间的链接 - Finding a link with a non-breaking space inside it with Selenium Java Regex只用非中断空格替换多个空格 - Java Regex that only replaces multiple whitepaces with Non-Breaking Spaces 如何在不添加空格的情况下丢弃元素,然后再将其添加到Arraylist中 - How to discard elements with non-breaking space before adding them in an Arraylist 如何使用apache pdf框将`Non-breaking space`打印为pdf? - How to print `Non-breaking space` to a pdf using apache pdf box? 分割为特殊的非空格空格字符 - Split on special non-space whitespace character 为了Character.isWhitespace的目的,Java是否将“正常”空间视为空格? - Does Java regard a 'normal' space as whitespace for the purposes of Character.isWhitespace? SpringBoot Flyway - sql 补丁的更多阶段(分别运行非破坏和破坏 sql 补丁) - SpringBoot Flyway - More phases of sql patches (running non-breaking and breaking sql patches separately) Java中的身份不明的空白字符 - Unidentified whitespace character in Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM