Background
I am trying getting clipboard data in HTML data flavor using Java. Thus I copy them into the clipboard from browsers. Then I am using java.awt.datatransfer.Clipboard to get them.
This works properly in Windows systems. But in Ubuntu there are some strange issues. The worst is when copied the data into clipboard from Firefox browser.
Example for reproducing the behavior
Java code:
import java.io.*;
import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;
public class WorkingWithClipboadData {
static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {
String fileName = "Result " + number + " " + paramCharset + ".txt";
OutputStream fileOut = new FileOutputStream(fileName);
fileOut.write(dataBytes, 0, dataBytes.length);
fileOut.close();
}
public static void main(String[] args) throws Exception {
Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();
int count = 0;
for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {
System.out.println(dataFlavor);
String mimeType = dataFlavor.getHumanPresentableName();
if ("text/html".equalsIgnoreCase(mimeType)) {
String paramClass = dataFlavor.getParameter("class");
if ("java.io.InputStream".equals(paramClass)) {
String paramCharset = dataFlavor.getParameter("charset");
if (paramCharset != null && paramCharset.startsWith("UTF")) {
System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");
InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);
ByteArrayOutputStream data = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int length = -1;
while ((length = inputStream.read(buffer)) != -1) {
data.write(buffer, 0, length);
}
data.flush();
inputStream.close();
byte[] dataBytes = data.toByteArray();
data.close();
doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);
}
}
}
}
}
}
Problem description
What I am doing is, opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Firefox. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program. After that the resulting files (only some of them as examples) looks like this:
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt"
00000000: feff fffd fffd 006c 0000 0065 0000 0074 .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073 ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069 ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f ...>.......<.../
00000040: 0000 0069 0000 003e 0000 ...i...>..
OK the FEFF
at the start looks like a UTF-16BE
byte-order-mark. But what is the FFFD
? And why are there those 0000
bytes between the single letters? UTF-16
encoding of l
is 006C
only. Seems as if all letters are encoded in 32 bit. But this is wrong for UTF-16
. And all non ASCII charcters are encoded with FFFD 0000
and so are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt"
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500 ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00 ..<./.i.>.
Here the EFBF BDEF BFBD
does not look like any known byte-order-mark. And all letters seems encoded in 16 bit, which is the double of the needed bits in UTF-8
. So the bits used seems always be the double count as needed. See in UTF-16
example above. And all not ASCII letters are encoded as EFBFBD
and so also are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt"
00000000: fffd fffd 006c 0000 0065 0000 0074 0000 .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000 .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000 .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000 .>.......<.../..
00000040: 0069 0000 003e 0000 .i...>..
Same picture as in the examples above. All letters are encoded using 32 bit. Only 16 bit shall be used in UTF-16
except the supplementary characters which uses surrogate pairs. And all not ASCII letters are encoded with FFFD 0000
and so are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt"
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000 ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000 t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000 :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000 >.......<.../...
00000040: 6900 0000 3e00 0000 i...>...
Only for to be complete. Same picture as above.
So the conclusion is that the Ubuntu clipboard is totally messed up after copying something into it from Firefox. At least for HTML data flavors and when reading the clipboard using Java.
Other browser used
When I do the same things using Chromium browser as the source of the data, then the problems becomes smaller.
So I am opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Chromium. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program.
The result looks like:
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt"
00000000: feff 003c 006d 0065 0074 0061 0020 0068 ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f .a.l.;.".>...<./
00000810: 0069 003e 0000 .i.>..
Chromium has more HTML around the selected in the HTML data flavors in clipboard. But the encoding looks properly. Also for the not ASCII ä
= 00E4
. But there also is a small problem, There are additional bytes 0000
at the end which should not be there. In UTF-16
there are 2 additional 00
bytes at the end.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt"
00000000: 3c6d 6574 6120 6874 7470 2d65 7175 6976 <meta http-equiv
...
000003f0: 696f 6e2d 636f 6c6f 723a 2069 6e69 7469 ion-color: initi
00000400: 616c 3b22 3ec3 a43c 2f69 3e00 al;">..</i>.
Same as above. Encoding looks properly for UTF-8
. But here also is one additional 00
byte at the end which not should be there.
Environment
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
Mozilla Firefox 61.0.1 (64-Bit)
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
Questions
Am I doing something wrong in my code?
Can someone advise how to avoid that messed up content in clipboard? Since the not ASCII characters are lost, at least when copied from Firefox, I don't think that we can repair this content.
Is this a known issue somehow? Can someone confirm the same behavior? If so, is there already a bug report in Firefox about this?
Or is this a problem which only occurs if Java code reads the clipboard content? Seems as if. Because if I copy content from Firefox and paste it in Libreoffice Writer then Unicode appears properly. And if I then copy content from Writer to the clipboard and do reading it using my Java program, then UTF
encodings are correct except the additional 00
bytes at the end. So clipboard content copied from Writer behaves like content copied from Chromium browser.
New insights
The bytes 0xFFFD
seems to be Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). So the 0xFDFF
is the little endian representation of this and the 0xEFBFBD
is the UTF-8 encoding of this. So all results seems to be results of wrong decoding and re encoding Unicode.
Seems as it the clipboard content coming from Firefox is UTF-16LE
with BOM
always. But then Java
gets this as UTF-8
. So the 2 byte BOM becomes two messed up characters, which are replaced with 0xEFBFBD, each additional 0x00
sequence becomes their own NUL
characters and all byte sequences which are not proper UTF-8
byte sequences becomes messed up characters, which are replaced with 0xEFBFBD. Then this pseudo UTF-8 will be re encoded. Now the garbage is complete.
Example:
The sequence aɛaüa
in UTF-16LE with BOM will be 0xFFFE 6100 5B02 6100 FC00 6100
.
This taken as UTF-8 (0xEFBFBD = not a proper UTF-8 byte sequence) = 0xEFBFBD 0xEFBFBD a
NUL
[
STX
a
NUL
0xEFBFBD NUL
a
NUL
.
This pseudo ASCII re encoded to UTF-16LE will be: 0xFDFF FDFF 6100 0000 5B00 0200 6100 0000 FDFF 0000 6100 0000
This pseudo ASCII re encoded to UTF-8 will be 0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100
And this is exactly what happens.
Other examples:
Â
= 0x00C2 = C200
in UTF-16LE = 0xEFBFBD00 in pseudo UTF-8
胂
= 0x80C2 = C280
in UTF-16LE = 0xC280 in pseudo UTF-8
So I think Firefox
is not to blame for this but either Ubuntu
or Java
's runtime environment. And because copy/paste from Firefox to Writer works in Ubuntu, I think Java
's runtime environment does not handle the Firefox data flavors in Ubuntu
clipboard correctly.
New insights:
I have compared the flavormap.properties
files of my Windows 10
and my Ubuntu
and there is a difference. In Ubuntu
the native name of the text/html
is UTF8_STRING
while in Windows
it is HTML Format
. So I thought that this may be the problem. So I've added the line
HTML\\ Format=text/html;charset=utf-8;eoln="\\n";terminators=0
to my flavormap.properties
file in Ubuntu
.
After that:
Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
new DataFlavor[]{
new DataFlavor("text/html;charset=UTF-16LE")
});
System.out.println(nativesForFlavors);
prints
{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}
But no changes in the results of the Ubuntu clipboard content when read by Java.
After looking at this quite a bit it looks like this is a longstanding bug with Java (even older report here ).
It looks like with the X11 Java components expect clipboard data to always be UTF-8 encoded and Firefox encodes data with UTF-16. Because of the assumptions Java makes it mangles the text by forcing parsing UTF-16 as UTF-8. I tried but couldn't find a good way to bypass the issue. The "text" part of "text/html" seems to indicate to Java that the bytes received from the clipboard should always be interpreted as text first and then offered in the various flavors. I couldn't find any straight forward way to access the pre-converted byte array from X11.
Since there is not a valuable answer until now, seems we need an ugly workaround to work with system clipboard of Ubuntu
using Java
. Very pity. O tempora, o mores. We live in times where Windows
is better in using Unicode encoding than Ubuntu
Linux
is.
What we know is already stated in the answer. So we have a proper encoded text/plain
result but a messed up text/html
result. And we know how the text/html
result is messed up.
So what we could do is "repairing" the wrong encoded HTML by at first replacing all messed up characters by the correct replacement characters. Then we can replace the replacement characters by the correct characters got from the proper encoded plain text. Of course this only can be done for the part of the HTML which is visible text and not within attributes. Because the attribute contents of course are not within the plain text.
Workaround:
import java.io.*;
import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;
import java.nio.charset.Charset;
public class WorkingWithClipboadDataBytesUTF8 {
static byte[] repairUTF8HTMLDataBytes(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {
//get all the not ASCII characters from plainDataBytes
//we need them for replacement later
String plain = new String(plainDataBytes, Charset.forName("UTF-8"));
char[] chars = plain.toCharArray();
StringBuffer unicodeChars = new StringBuffer();
for (int i = 0; i < chars.length; i++) {
if (chars[i] > 127) unicodeChars.append(chars[i]);
}
System.out.println(unicodeChars);
//ommit the first 6 bytes from htmlDataBytes which are the wrong BOM
htmlDataBytes = java.util.Arrays.copyOfRange(htmlDataBytes, 6, htmlDataBytes.length);
//The wrong UTF-8 encoded single bytes which are not replaced by `0xefbfbd`
//are coincidentally UTF-16LE if two bytes immediately following each other.
//So we are "repairing" this accordingly.
//Goal: all garbage shall be the replacement character 0xFFFD.
//replace parts of a surrogate pair with 0xFFFD
//replace the wrong UFT-8 bytes 0xefbfbd for replacement character with 0xFFFD
ByteArrayInputStream in = new ByteArrayInputStream(htmlDataBytes);
ByteArrayOutputStream out = new ByteArrayOutputStream();
int b = -1;
int[] btmp = new int[6];
while ((b = in.read()) != -1) {
btmp[0] = b;
btmp[1] = in.read(); //there must always be two bytes because of wron encoding 16 bit Unicode
if (btmp[0] != 0xef && btmp[1] != 0xef) { // not a replacement character
if (btmp[1] > 0xd7 && btmp[1] < 0xe0) { // part of a surrogate pair
out.write(0xFD); out.write(0xFF);
} else {
out.write(btmp[0]); out.write(btmp[1]); //two default bytes
}
} else { // at least one must be the replacelement 0xefbfbd
btmp[2] = in.read(); btmp[3] = in.read(); //there must be at least two further bytes
if (btmp[0] != 0xef && btmp[1] == 0xef && btmp[2] == 0xbf && btmp[3] == 0xbd ||
btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] != 0xef) {
out.write(0xFD); out.write(0xFF);
} else if (btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] == 0xef) {
btmp[4] = in.read(); btmp[5] = in.read();
if (btmp[4] == 0xbf && btmp[5] == 0xbd) {
out.write(0xFD); out.write(0xFF);
} else {
throw new Exception("Wrong byte sequence: "
+ String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]),
new Throwable().fillInStackTrace());
}
} else {
throw new Exception("Wrong byte sequence: "
+ String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]),
new Throwable().fillInStackTrace());
}
}
}
htmlDataBytes = out.toByteArray();
//now get this as UTF_16LE (2 byte for each character, little endian)
String html = new String(htmlDataBytes, Charset.forName("UTF-16LE"));
System.out.println(html);
//replace all of the wrongUnicode with the unicodeChars selected from plainDataBytes
boolean insideTag = false;
int unicodeCharCount = 0;
char[] textChars = html.toCharArray();
StringBuffer newHTML = new StringBuffer();
for (int i = 0; i < textChars.length; i++) {
if (textChars[i] == '<') insideTag = true;
if (textChars[i] == '>') insideTag = false;
if (!insideTag && textChars[i] > 127) {
if (unicodeCharCount >= unicodeChars.length())
throw new Exception("Unicode chars count don't match. "
+ "We got from plain text " + unicodeChars.length() + " chars. Text until now:\n" + newHTML,
new Throwable().fillInStackTrace());
newHTML.append(unicodeChars.charAt(unicodeCharCount++));
} else {
newHTML.append(textChars[i]);
}
}
html = newHTML.toString();
System.out.println(html);
return html.getBytes("UTF-8");
}
static void doSomethingWithUTF8BytesFromClipboard(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {
if (plainDataBytes != null && htmlDataBytes != null) {
String fileName;
OutputStream fileOut;
fileName = "ResultPlainText.txt";
fileOut = new FileOutputStream(fileName);
fileOut.write(plainDataBytes, 0, plainDataBytes.length);
fileOut.close();
fileName = "ResultHTMLRaw.txt";
fileOut = new FileOutputStream(fileName);
fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
fileOut.close();
//do we have wrong encoded UTF-8 in htmlDataBytes?
if (htmlDataBytes[0] == (byte)0xef && htmlDataBytes[1] == (byte)0xbf && htmlDataBytes[2] == (byte)0xbd
&& htmlDataBytes[3] == (byte)0xef && htmlDataBytes[4] == (byte)0xbf && htmlDataBytes[5] == (byte)0xbd) {
//try repair the UTF-8 HTML data bytes
htmlDataBytes = repairUTF8HTMLDataBytes(plainDataBytes, htmlDataBytes);
//do we have additional 0x00 byte at the end?
} else if (htmlDataBytes[htmlDataBytes.length-1] == (byte)0x00) {
//do repair this
htmlDataBytes = java.util.Arrays.copyOf(htmlDataBytes, htmlDataBytes.length-1);
}
fileName = "ResultHTML.txt";
fileOut = new FileOutputStream(fileName);
fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
fileOut.close();
}
}
public static void main(String[] args) throws Exception {
Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();
byte[] htmlDataBytes = null;
byte[] plainDataBytes = null;
for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {
String mimeType = dataFlavor.getHumanPresentableName();
if ("text/html".equalsIgnoreCase(mimeType)) {
String paramClass = dataFlavor.getParameter("class");
if ("[B".equals(paramClass)) {
String paramCharset = dataFlavor.getParameter("charset");
if (paramCharset != null && "UTF-8".equalsIgnoreCase(paramCharset)) {
htmlDataBytes = (byte[])clipboard.getData(dataFlavor);
}
} //else if("java.io.InputStream".equals(paramClass)) ...
} else if ("text/plain".equalsIgnoreCase(mimeType)) {
String paramClass = dataFlavor.getParameter("class");
if ("[B".equals(paramClass)) {
String paramCharset = dataFlavor.getParameter("charset");
if (paramCharset != null && "UTF-8".equalsIgnoreCase(paramCharset)) {
plainDataBytes = (byte[])clipboard.getData(dataFlavor);
}
} //else if("java.io.InputStream".equals(paramClass)) ...
}
}
doSomethingWithUTF8BytesFromClipboard(plainDataBytes, htmlDataBytes);
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.