简体   繁体   中英

Clipboard content is messed up when copied from Firefox and read using Java in Ubuntu

Background

I am trying getting clipboard data in HTML data flavor using Java. Thus I copy them into the clipboard from browsers. Then I am using java.awt.datatransfer.Clipboard to get them.

This works properly in Windows systems. But in Ubuntu there are some strange issues. The worst is when copied the data into clipboard from Firefox browser.

Example for reproducing the behavior

Java code:

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

public class WorkingWithClipboadData {

 static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {

  String fileName = "Result " + number + " " + paramCharset + ".txt";

  OutputStream fileOut = new FileOutputStream(fileName);
  fileOut.write(dataBytes, 0, dataBytes.length);
  fileOut.close();

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  int count = 0;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

System.out.println(dataFlavor);

   String mimeType = dataFlavor.getHumanPresentableName();
   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("java.io.InputStream".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && paramCharset.startsWith("UTF")) {

System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");

      InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);

      ByteArrayOutputStream data = new ByteArrayOutputStream();

      byte[] buffer = new byte[1024];
      int length = -1;
      while ((length = inputStream.read(buffer)) != -1) {
       data.write(buffer, 0, length);
      }
      data.flush();
      inputStream.close();

      byte[] dataBytes = data.toByteArray();
      data.close();

      doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);

     }
    }
   }
  }
 }

}

Problem description

What I am doing is, opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Firefox. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program. After that the resulting files (only some of them as examples) looks like this:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff fffd fffd 006c 0000 0065 0000 0074  .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073  ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069  ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f  ...>.......<.../
00000040: 0000 0069 0000 003e 0000                 ...i...>..

OK the FEFF at the start looks like a UTF-16BE byte-order-mark. But what is the FFFD ? And why are there those 0000 bytes between the single letters? UTF-16 encoding of l is 006C only. Seems as if all letters are encoded in 32 bit. But this is wrong for UTF-16 . And all non ASCII charcters are encoded with FFFD 0000 and so are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500  ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf  r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00                 ..<./.i.>.

Here the EFBF BDEF BFBD does not look like any known byte-order-mark. And all letters seems encoded in 16 bit, which is the double of the needed bits in UTF-8 . So the bits used seems always be the double count as needed. See in UTF-16 example above. And all not ASCII letters are encoded as EFBFBD and so also are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt" 
00000000: fffd fffd 006c 0000 0065 0000 0074 0000  .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000  .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000  .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000  .>.......<.../..
00000040: 0069 0000 003e 0000                      .i...>..

Same picture as in the examples above. All letters are encoded using 32 bit. Only 16 bit shall be used in UTF-16 except the supplementary characters which uses surrogate pairs. And all not ASCII letters are encoded with FFFD 0000 and so are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt" 
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000  ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000  t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000  :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000  >.......<.../...
00000040: 6900 0000 3e00 0000                      i...>...

Only for to be complete. Same picture as above.

So the conclusion is that the Ubuntu clipboard is totally messed up after copying something into it from Firefox. At least for HTML data flavors and when reading the clipboard using Java.

Other browser used

When I do the same things using Chromium browser as the source of the data, then the problems becomes smaller.

So I am opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Chromium. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program.

The result looks like:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff 003c 006d 0065 0074 0061 0020 0068  ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f  .a.l.;.".>...<./
00000810: 0069 003e 0000                           .i.>..

Chromium has more HTML around the selected in the HTML data flavors in clipboard. But the encoding looks properly. Also for the not ASCII ä = 00E4 . But there also is a small problem, There are additional bytes 0000 at the end which should not be there. In UTF-16 there are 2 additional 00 bytes at the end.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: 3c6d 6574 6120 6874 7470 2d65 7175 6976  <meta http-equiv
...
000003f0: 696f 6e2d 636f 6c6f 723a 2069 6e69 7469  ion-color: initi
00000400: 616c 3b22 3ec3 a43c 2f69 3e00            al;">..</i>.

Same as above. Encoding looks properly for UTF-8 . But here also is one additional 00 byte at the end which not should be there.

Environment

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"


Mozilla Firefox 61.0.1 (64-Bit)


java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Questions

Am I doing something wrong in my code?

Can someone advise how to avoid that messed up content in clipboard? Since the not ASCII characters are lost, at least when copied from Firefox, I don't think that we can repair this content.

Is this a known issue somehow? Can someone confirm the same behavior? If so, is there already a bug report in Firefox about this?

Or is this a problem which only occurs if Java code reads the clipboard content? Seems as if. Because if I copy content from Firefox and paste it in Libreoffice Writer then Unicode appears properly. And if I then copy content from Writer to the clipboard and do reading it using my Java program, then UTF encodings are correct except the additional 00 bytes at the end. So clipboard content copied from Writer behaves like content copied from Chromium browser.


New insights

The bytes 0xFFFD seems to be Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). So the 0xFDFF is the little endian representation of this and the 0xEFBFBD is the UTF-8 encoding of this. So all results seems to be results of wrong decoding and re encoding Unicode.

Seems as it the clipboard content coming from Firefox is UTF-16LE with BOM always. But then Java gets this as UTF-8 . So the 2 byte BOM becomes two messed up characters, which are replaced with 0xEFBFBD, each additional 0x00 sequence becomes their own NUL characters and all byte sequences which are not proper UTF-8 byte sequences becomes messed up characters, which are replaced with 0xEFBFBD. Then this pseudo UTF-8 will be re encoded. Now the garbage is complete.

Example:

The sequence aɛaüa in UTF-16LE with BOM will be 0xFFFE 6100 5B02 6100 FC00 6100 .

This taken as UTF-8 (0xEFBFBD = not a proper UTF-8 byte sequence) = 0xEFBFBD 0xEFBFBD a NUL [ STX a NUL 0xEFBFBD NUL a NUL .

This pseudo ASCII re encoded to UTF-16LE will be: 0xFDFF FDFF 6100 0000 5B00 0200 6100 0000 FDFF 0000 6100 0000

This pseudo ASCII re encoded to UTF-8 will be 0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100

And this is exactly what happens.

Other examples:

 = 0x00C2 = C200 in UTF-16LE = 0xEFBFBD00 in pseudo UTF-8

= 0x80C2 = C280 in UTF-16LE = 0xC280 in pseudo UTF-8

So I think Firefox is not to blame for this but either Ubuntu or Java 's runtime environment. And because copy/paste from Firefox to Writer works in Ubuntu, I think Java 's runtime environment does not handle the Firefox data flavors in Ubuntu clipboard correctly.


New insights:

I have compared the flavormap.properties files of my Windows 10 and my Ubuntu and there is a difference. In Ubuntu the native name of the text/html is UTF8_STRING while in Windows it is HTML Format . So I thought that this may be the problem. So I've added the line

HTML\\ Format=text/html;charset=utf-8;eoln="\\n";terminators=0

to my flavormap.properties file in Ubuntu .

After that:

Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
   new DataFlavor[]{
   new DataFlavor("text/html;charset=UTF-16LE")
   });

System.out.println(nativesForFlavors);

prints

{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}

But no changes in the results of the Ubuntu clipboard content when read by Java.

After looking at this quite a bit it looks like this is a longstanding bug with Java (even older report here ).

It looks like with the X11 Java components expect clipboard data to always be UTF-8 encoded and Firefox encodes data with UTF-16. Because of the assumptions Java makes it mangles the text by forcing parsing UTF-16 as UTF-8. I tried but couldn't find a good way to bypass the issue. The "text" part of "text/html" seems to indicate to Java that the bytes received from the clipboard should always be interpreted as text first and then offered in the various flavors. I couldn't find any straight forward way to access the pre-converted byte array from X11.

Since there is not a valuable answer until now, seems we need an ugly workaround to work with system clipboard of Ubuntu using Java . Very pity. O tempora, o mores. We live in times where Windows is better in using Unicode encoding than Ubuntu Linux is.

What we know is already stated in the answer. So we have a proper encoded text/plain result but a messed up text/html result. And we know how the text/html result is messed up.

So what we could do is "repairing" the wrong encoded HTML by at first replacing all messed up characters by the correct replacement characters. Then we can replace the replacement characters by the correct characters got from the proper encoded plain text. Of course this only can be done for the part of the HTML which is visible text and not within attributes. Because the attribute contents of course are not within the plain text.

Workaround:

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

import java.nio.charset.Charset;

public class WorkingWithClipboadDataBytesUTF8 {

 static byte[] repairUTF8HTMLDataBytes(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  //get all the not ASCII characters from plainDataBytes
  //we need them for replacement later
  String plain = new String(plainDataBytes, Charset.forName("UTF-8"));
  char[] chars = plain.toCharArray();
  StringBuffer unicodeChars = new StringBuffer();
  for (int i = 0; i < chars.length; i++) {
   if (chars[i] > 127) unicodeChars.append(chars[i]);
  }
System.out.println(unicodeChars);

  //ommit the first 6 bytes from htmlDataBytes which are the wrong BOM
  htmlDataBytes = java.util.Arrays.copyOfRange(htmlDataBytes, 6, htmlDataBytes.length);

  //The wrong UTF-8 encoded single bytes which are not replaced by `0xefbfbd` 
  //are coincidentally UTF-16LE if two bytes immediately following each other.
  //So we are "repairing" this accordingly. 
  //Goal: all garbage shall be the replacement character 0xFFFD.

  //replace parts of a surrogate pair with 0xFFFD
  //replace the wrong UFT-8 bytes 0xefbfbd for replacement character with 0xFFFD
  ByteArrayInputStream in = new ByteArrayInputStream(htmlDataBytes);
  ByteArrayOutputStream out = new ByteArrayOutputStream();
  int b = -1;
  int[] btmp = new int[6];
  while ((b = in.read()) != -1) {
   btmp[0] = b;
   btmp[1] = in.read(); //there must always be two bytes because of wron encoding 16 bit Unicode
   if (btmp[0] != 0xef && btmp[1] != 0xef) { // not a replacement character
    if (btmp[1] > 0xd7 && btmp[1] < 0xe0) { // part of a surrogate pair
     out.write(0xFD); out.write(0xFF);
    } else {
     out.write(btmp[0]); out.write(btmp[1]); //two default bytes
    }
   } else { // at least one must be the replacelement 0xefbfbd
    btmp[2] = in.read(); btmp[3] = in.read(); //there must be at least two further bytes
    if (btmp[0] != 0xef && btmp[1] == 0xef && btmp[2] == 0xbf && btmp[3] == 0xbd ||
        btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] != 0xef) {
     out.write(0xFD); out.write(0xFF);
    } else if (btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] == 0xef) {
     btmp[4] = in.read(); btmp[5] = in.read();
     if (btmp[4] == 0xbf &&  btmp[5] == 0xbd) {
      out.write(0xFD); out.write(0xFF);
     } else {
      throw new Exception("Wrong byte sequence: "
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]), 
      new Throwable().fillInStackTrace());
     }
    } else {
     throw new Exception("Wrong byte sequence: " 
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]),
      new Throwable().fillInStackTrace());
    }
   }
  }

  htmlDataBytes = out.toByteArray();

  //now get this as UTF_16LE (2 byte for each character, little endian)
  String html = new String(htmlDataBytes, Charset.forName("UTF-16LE"));
System.out.println(html);

  //replace all of the wrongUnicode with the unicodeChars selected from plainDataBytes
  boolean insideTag = false;
  int unicodeCharCount = 0;
  char[] textChars = html.toCharArray();
  StringBuffer newHTML = new StringBuffer();
  for (int i = 0; i < textChars.length; i++) {
   if (textChars[i] == '<') insideTag = true;
   if (textChars[i] == '>') insideTag = false;
   if (!insideTag && textChars[i] > 127) {
    if (unicodeCharCount >= unicodeChars.length()) 
     throw new Exception("Unicode chars count don't match. " 
      + "We got from plain text " + unicodeChars.length() + " chars. Text until now:\n" + newHTML,
      new Throwable().fillInStackTrace());

    newHTML.append(unicodeChars.charAt(unicodeCharCount++));
   } else {
    newHTML.append(textChars[i]);
   }
  }

  html = newHTML.toString();
System.out.println(html);

  return html.getBytes("UTF-8");

 }

 static void doSomethingWithUTF8BytesFromClipboard(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  if (plainDataBytes != null && htmlDataBytes != null) {

   String fileName; 
   OutputStream fileOut;

   fileName = "ResultPlainText.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(plainDataBytes, 0, plainDataBytes.length);
   fileOut.close();

   fileName = "ResultHTMLRaw.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

   //do we have wrong encoded UTF-8 in htmlDataBytes?
   if (htmlDataBytes[0] == (byte)0xef && htmlDataBytes[1] == (byte)0xbf && htmlDataBytes[2] == (byte)0xbd 
    && htmlDataBytes[3] == (byte)0xef && htmlDataBytes[4] == (byte)0xbf && htmlDataBytes[5] == (byte)0xbd) {
    //try repair the UTF-8 HTML data bytes
    htmlDataBytes = repairUTF8HTMLDataBytes(plainDataBytes, htmlDataBytes);
          //do we have additional 0x00 byte at the end?
   } else if (htmlDataBytes[htmlDataBytes.length-1] == (byte)0x00) {
    //do repair this
    htmlDataBytes = java.util.Arrays.copyOf(htmlDataBytes, htmlDataBytes.length-1);
   }

   fileName = "ResultHTML.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

  }

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  byte[] htmlDataBytes = null;
  byte[] plainDataBytes = null;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

   String mimeType = dataFlavor.getHumanPresentableName();

   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      htmlDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...

   } else if ("text/plain".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      plainDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...
   }
  }

  doSomethingWithUTF8BytesFromClipboard(plainDataBytes, htmlDataBytes);

 }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM