简体   繁体   中英

Why is non-breaking space not a whitespace character in java?

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on java's spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on Character class would do the job for me.

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:

It is a Unicode space character ( SPACE_SEPARATOR , LINE_SEPARATOR , or PARAGRAPH_SEPARATOR ) but is not also a non-breaking space ( '\ ' , '\ ' , '\ ' ).

Why is that?

The implementation of corresponding .NET equivalent is less discriminating.

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

Since Java 5 there is also an isSpaceChar(int) method. Does that not do what you want?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

As posted above, isSpaceChar(int) will provide the OP with a track to the answer. It seems fairly discreetly documented, but this method is actually useable with regexes . So:

    "X\u00A0X X".replaceAll("\\p{javaSpaceChar}", "_");

will produce a "X_X_X" string. It is left as an exercise for the reader to come up with the regex to trim a string. (Pattern with some flags should do the trick.)

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace .

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

It looks like the method name ( isWhitespace ) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR. 

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

使用具有相同奇怪的isWhitespace行为的apache commons函数StringUtils.isBlank() (及相关函数)时也要小心,即不间断空格被认为是非空白的。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM