简体   繁体   中英

What can cause Java compiler to fail while parsing a comment?

The following code is a valid Java program.

public class Foo
{
    public static void \u006d\u0061\u0069\u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The main identifier is written using Unicode escape sequences. It compiles and runs fine.

$ javac Foo.java && java Foo
hello, world

Although the following details may not be necessary for this question, I am sharing it in case someone is curious about it. I am using Java compiler from OpenJDK on Debian 8.0 but what I ask in this question should be applicable to any Java compiler.

$ javac -version
javac 1.7.0_79
$ readlink -f $(which javac)
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

The following program is an error because the escape sequence used to write m of main is invalid.

public class Foo
{
    public static void \u6d\u0061\u0069\u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The compiler complains about illegal unicode sequence.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    public static void \u6d\u0061\u0069\u006e(String[] args)
                           ^
Foo.java:3: error: invalid method declaration; return type required
    public static void \u6d\u0061\u0069\u006e(String[] args)
                            ^
2 error

What surprised me is that the following program is also invalid even though the illegal unicode escape sequence seems to appear to be in a comment.

public class Foo
{
    // This comment contains \u6d.
    public static void main(String[] args)
    {
        System.out.println("hello, world");
    }
}

Here is the error.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    // This comment contains \u6d.
                                 ^
1 error

The compiler complains about the illegal unicode escape sequence although it appears to be in a comment.

The reason behind this behaviour becomes clear when we see how an end-of-line comment is defined in JLS §3.7 .

EndOfLineComment:
/ / {InputCharacter} 

JLS §3.4 defines InputCharacter as follows.

InputCharacter:
  UnicodeInputCharacter but not CR or LF 

Finally, JLS §3.3 defines UnicodeInputCharacter as follows.

UnicodeInputCharacter:
  UnicodeEscape
  RawInputCharacter

UnicodeEscape:
  \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
  u {u}

HexDigit:
  (one of)
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

RawInputCharacter:
  any Unicode character

Therefore, the lexical analyzer is required to first recognize the Unicode escape sequences in order to recognize comments, and if an illegal Unicode escape sequence is found, the lexical analysis would fail and an error would occur. Therefore, the compiler would never proceed to recognizing the comment that contained the illegal Unicode escape sequence.

Although I used to think that everything from the start of a comment (say // ) till the end is ignored, the above example shows that this is not the case because the lexical analyzer has to recognize Unicode escape sequences between the start of a comment and the end of a comment, and an illegal Unicode escape sequence can cause the lexical analysis to fail.

What else can cause the compiler to fail while parsing a comment?

Short:

Nothing (nothing else ).

Long:

Logically, the \\u\u003c/code> escape sequences are handled before lexical processing (scanning/tokenizing) takes place. According to https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.2 :

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

  1. A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \\uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

  2. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

  3. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

So technically, \\u6d\u003c/code> in your example is NOT a part of the comment. Whether or not it belongs to that comment is determined after it is translated back to a unicode code-point. But unfortunately it fails there.

As a proof, following class should compile:

public class Test {
    // is comment, the rest, not\u000a public static void main( String[] args) {
        System.out.println("See!");
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM