简体   繁体   中英

Source encoding of files in Maven java project

The source encoding of .java files in our Maven project which is stored in Subversion mostly ASCII and some files are UTF-8.

I think the intention was that these files would be UTF-8. In the pom file the source encoding is specified as UTF-8.

Now our build fails specifically our SonarQube analysis fails on a .java file which is ISO-8859 and which has a variable with a special character. Using a special character is not a good idea think but that aside, shouldn't the java files have consistent (UTF-8) encoding?

Or does it not matter that most are ASCII and only some are UTF-8? It is the thought that counts?

I btw don't understand how these files end up with ASCII encoding. When I use a IDE or editor like SublimeText files end up as UTF-8.

ASCII I only get when I use NotePad on MS Windows. Java developers do not typically use that for programming.

Should we change the source files to use UTF-8? Or maybe it doens't matter and we can leave this as it is?

As an example. Using MS Windows I create one file using SublimeText and one file using Notepad.exe. I put text 1234Ï in those files. The text contains a special character I with two dots.

When I look at these file on Linux using file

ostraaten@io:/tmp/iconv$ file sublimtext.txt 
sublimtext.txt: UTF-8 Unicode (with BOM) text, with no line terminators
ostraaten@io:/tmp/iconv$ file notepad.txt 
notepad.txt: ISO-8859 text, with no line terminators
ostraaten@io:/tmp/iconv$ 

So this shows Notepad saved the file as ISO-8859 regardless of the contents. When I check the files using iconv

ostraaten@io:/tmp/iconv$ iconv -f UTF-8 notepad.txt -o /dev/null 
iconv: incomplete character or shift sequence at end of buffer
ostraaten@io:/tmp/iconv$ iconv -f UTF-8 sublimtext.txt -o /dev/null 
ostraaten@io:/tmp/iconv$ 

I can open and save the file notepad.txt using SublimeText, the encoding still shows up as ISO-8859.

The character does display correctly in both files. So this support the idea that somewhere the editor tries to determine encoding from the contents of the file. But somewhere else the file is still marked and recognized as ISO-8859.

I can change the encoding using iconv

ostraaten@io:/tmp/iconv$ iconv -f ISO-8859-15 -t UTF-8 notepad.txt > notepad-utf8.txt
ostraaten@io:/tmp/iconv$ file notepad-utf8.txt 
notepad-utf8.txt: UTF-8 Unicode text, with no line terminators
ostraaten@io:/tmp/iconv$ 
straaten@io:/tmp/iconv$ iconv -f UTF-8 notepad-utf8.txt -o /dev/null

The conversion was successful because the message incomplete character is gone.

Seven bits ASCII is a subset of UTF-8. ISO-8859-1 is Latin 1 with some 8 bits problematic bytes.

So someone worked around UTF-8 with editor or IDE. Some version control checkins substitute text back into the source, but in your case that seems not to be the case.

UTF-8 is a solid choice, though needs some care.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM