從Firefox復制並在Ubuntu中使用Java讀取時，剪貼板內容會混亂

Question

背景

我正在嘗試使用Java獲取HTML數據風格的剪貼板數據。 因此我將它們從瀏覽器復制到剪貼板中。 然后我使用java.awt.datatransfer.Clipboard來獲取它們。

這在Windows系統中正常工作。 但在Ubuntu中存在一些奇怪的問題。 最糟糕的是從Firefox瀏覽器將數據復制到剪貼板。

再現行為的示例

Java代碼：

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

public class WorkingWithClipboadData {

 static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {

  String fileName = "Result " + number + " " + paramCharset + ".txt";

  OutputStream fileOut = new FileOutputStream(fileName);
  fileOut.write(dataBytes, 0, dataBytes.length);
  fileOut.close();

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  int count = 0;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

System.out.println(dataFlavor);

   String mimeType = dataFlavor.getHumanPresentableName();
   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("java.io.InputStream".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && paramCharset.startsWith("UTF")) {

System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");

      InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);

      ByteArrayOutputStream data = new ByteArrayOutputStream();

      byte[] buffer = new byte[1024];
      int length = -1;
      while ((length = inputStream.read(buffer)) != -1) {
       data.write(buffer, 0, length);
      }
      data.flush();
      inputStream.close();

      byte[] dataBytes = data.toByteArray();
      data.close();

      doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);

     }
    }
   }
  }
 }

}

問題描述

我正在做的是，在Firefox中打開URL https://en.wikipedia.org/wiki/Germanic_umlaut 。 然后我在那里選擇“letters：ä”並將其復制到剪貼板中。 然后我運行我的Java程序。 之后，生成的文件（僅其中一些作為示例）如下所示：

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff fffd fffd 006c 0000 0065 0000 0074  .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073  ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069  ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f  ...>.......<.../
00000040: 0000 0069 0000 003e 0000                 ...i...>..

好的，開始時的FEFF看起來像UTF-16BE字節順序標記。 但是什么是FFFD ？ 為什么單個字母之間有0000字節？ l UTF-16編碼僅為006C 。 似乎所有字母都以32位編碼。 但這對UTF-16是錯誤的。 並且所有非ASCII字符都使用FFFD 0000編碼，因此丟失。

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500  ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf  r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00                 ..<./.i.>.

這里EFBF BDEF BFBD看起來不像任何已知的字節順序標記。 並且所有字母似乎都以16位編碼，這是UTF-8所需位的兩倍。 所以使用的位似乎總是需要的雙重計數。 參見上面的UTF-16示例。 並且所有非ASCII字母都被編碼為EFBFBD ，因此也丟失了。

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt" 
00000000: fffd fffd 006c 0000 0065 0000 0074 0000  .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000  .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000  .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000  .>.......<.../..
00000040: 0069 0000 003e 0000                      .i...>..

與上面的例子相同。 所有字母都使用32位編碼。 除了使用代理對的補充字符外， UTF-16只能使用16位。 所有非ASCII字母都用FFFD 0000編碼，因此丟失。

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt" 
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000  ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000  t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000  :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000  >.......<.../...
00000040: 6900 0000 3e00 0000                      i...>...

只是為了完成。 與上面相同的圖片。

所以結論是Ubuntu剪貼板在從Firefox復制到它之后完全搞砸了。 至少對於HTML數據風格以及使用Java讀取剪貼板時。

其他瀏覽器使用

當我使用Chromium瀏覽器作為數據源做同樣的事情時，問題就會變小。

所以我在Chromium打開網址https://en.wikipedia.org/wiki/Germanic_umlaut 。 然后我在那里選擇“letters：ä”並將其復制到剪貼板中。 然后我運行我的Java程序。

結果如下：

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff 003c 006d 0065 0074 0061 0020 0068  ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f  .a.l.;.".>...<./
00000810: 0069 003e 0000                           .i.>..

Chromium在剪貼板中的HTML數據風格中選擇了更多的HTML。 但編碼看起來很合適。 還在為沒有ASCII ä = 00E4 。 但是也有一個小問題，最后還有額外的字節0000 ，不應該存在。 在UTF-16還有2個額外的00字節。

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: 3c6d 6574 6120 6874 7470 2d65 7175 6976  <meta http-equiv
...
000003f0: 696f 6e2d 636f 6c6f 723a 2069 6e69 7469  ion-color: initi
00000400: 616c 3b22 3ec3 a43c 2f69 3e00            al;">..</i>.

與上述相同。 編碼看起來適合UTF-8 。 但是這里還有一個額外的00字節，不應該存在。

環境

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"


Mozilla Firefox 61.0.1 (64-Bit)


java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

問題

我在代碼中做錯了嗎？

有人可以建議如何避免剪貼板中的混亂內容？ 由於非ASCII字符丟失，至少從Firefox復制時，我認為我們不能修復此內容。

這是一個已知的問題嗎？ 有人可以確認相同的行為嗎？ 如果是這樣，Firefox中是否已經有關於此的錯誤報告？

或者這是一個只有在Java代碼讀取剪貼板內容時才會出現的問題？ 似乎好像。 因為如果我從Firefox復制內容並將其粘貼到Libreoffice Writer中，則Unicode會正確顯示。 如果我然后將內容從Writer復制到剪貼板並使用我的Java程序讀取它，那么UTF編碼是正確的，除了最后的額外00字節。 因此，從Writer復制的剪貼板內容的行為類似於從Chromium瀏覽器復制的內容。

新的見解

字節0xFFFD似乎是Unicode字符'REPLACEMENT CHARACTER'（U + FFFD）。 所以0xFDFF是這個的小端表示， 0xEFBFBD是這個的UTF-8編碼。 因此，所有結果似乎都是錯誤解碼和重新編碼Unicode的結果。

好像來自Firefox的剪貼板內容是帶有BOM UTF-16LE 。 但是Java將其作為UTF-8 。 因此，2字節的BOM變為兩個混亂的字符，用0xEFBFBD替換，每個額外的0x00序列成為它們自己的NUL字符，所有不正確的UTF-8字節序列的字節序列變成混亂的字符，用0xEFBFBD替換。 然后這個偽UTF-8將被重新編碼。 現在垃圾完成了。

例：

具有BOM的UTF-16LE中的序列aɛaüa將是0xFFFE 6100 5B02 6100 FC00 6100 。

這被視為UTF-8（0xEFBFBD =不是一個正確的UTF-8字節序列）= 0xEFBFBD 0xEFBFBD a NUL [ STX a NUL 0xEFBFBD NUL a NUL 。

重新編碼為UTF-16LE的偽ASCII將為： 0xFDFF FDFF 6100 0000 5B00 0200 6100 0000 FDFF 0000 6100 0000

重新編碼為UTF-8的偽ASCII將為0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100

這正是發生的事情。

其他例子：

Â = 0x00C2 = UTF-16LE中的C200 =偽UTF-8中的0xEFBFBD00

胂 = 0x80C2 = UTF-16LE中的C280 =偽UTF-8中的0xC280

所以我認為Firefox不應該歸咎於此，而是Ubuntu或Java的運行時環境。 因為從Firefox到Writer的復制/粘貼在Ubuntu中工作，我認為Java的運行時環境無法正確處理Ubuntu剪貼板中的Firefox數據風格。

新見解：

我比較了我的Windows 10和我的Ubuntu的flavormap.properties文件，並且有所不同。 在Ubuntu中， text/html的本機名稱是UTF8_STRING而在Windows它是HTML Format 。 所以我認為這可能是問題所在。 所以我添加了這條線

HTML\\ Format=text/html;charset=utf-8;eoln="\\n";terminators=0

到我在Ubuntu flavormap.properties文件。

之后：

Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
   new DataFlavor[]{
   new DataFlavor("text/html;charset=UTF-16LE")
   });

System.out.println(nativesForFlavors);

版畫

{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}

但是，當Java讀取時，Ubuntu剪貼板內容的結果沒有變化。

Answer 1

看了這個之后，看起來這是Java的長期存在的錯誤（這里甚至更老的報道）。

看起來X11 Java組件看起來希望剪貼板數據始終采用UTF-8編碼，Firefox使用UTF-16編碼數據。 由於Java的假設，它通過強制將UTF-16解析為UTF-8來破壞文本。 我試過但找不到繞過這個問題的好方法。 “text / html”的“text”部分似乎向Java表明從剪貼板接收的字節應始終首先解釋為文本，然后以各種方式提供。 我找不到任何直接從X11訪問預轉換字節數組的方法。

Answer 2

由於到目前為止還沒有一個有價值的答案，似乎我們需要一個丑陋的解決方法來使用Java來使用Ubuntu系統剪貼板。 非常可惜。 O tempora，o mores。 我們生活在Windows使用Unicode編碼比使用Ubuntu Linux更好的時候。

我們所知道的內容已在答案中說明。 所以我們有一個正確的編碼text/plain結果，但是混亂的text/html結果。 我們知道text/html結果是如何搞砸的。

所以我們可以做的是“修復”錯誤的編碼HTML，首先用正確的替換字符替換所有混亂的字符。 然后我們可以用正確編碼的純文本中的正確字符替換替換字符。 當然，這只能用於HTML的部分，它是可見文本而不是屬性。 因為屬性內容當然不在純文本中。

解決方法：

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

import java.nio.charset.Charset;

public class WorkingWithClipboadDataBytesUTF8 {

 static byte[] repairUTF8HTMLDataBytes(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  //get all the not ASCII characters from plainDataBytes
  //we need them for replacement later
  String plain = new String(plainDataBytes, Charset.forName("UTF-8"));
  char[] chars = plain.toCharArray();
  StringBuffer unicodeChars = new StringBuffer();
  for (int i = 0; i < chars.length; i++) {
   if (chars[i] > 127) unicodeChars.append(chars[i]);
  }
System.out.println(unicodeChars);

  //ommit the first 6 bytes from htmlDataBytes which are the wrong BOM
  htmlDataBytes = java.util.Arrays.copyOfRange(htmlDataBytes, 6, htmlDataBytes.length);

  //The wrong UTF-8 encoded single bytes which are not replaced by `0xefbfbd` 
  //are coincidentally UTF-16LE if two bytes immediately following each other.
  //So we are "repairing" this accordingly. 
  //Goal: all garbage shall be the replacement character 0xFFFD.

  //replace parts of a surrogate pair with 0xFFFD
  //replace the wrong UFT-8 bytes 0xefbfbd for replacement character with 0xFFFD
  ByteArrayInputStream in = new ByteArrayInputStream(htmlDataBytes);
  ByteArrayOutputStream out = new ByteArrayOutputStream();
  int b = -1;
  int[] btmp = new int[6];
  while ((b = in.read()) != -1) {
   btmp[0] = b;
   btmp[1] = in.read(); //there must always be two bytes because of wron encoding 16 bit Unicode
   if (btmp[0] != 0xef && btmp[1] != 0xef) { // not a replacement character
    if (btmp[1] > 0xd7 && btmp[1] < 0xe0) { // part of a surrogate pair
     out.write(0xFD); out.write(0xFF);
    } else {
     out.write(btmp[0]); out.write(btmp[1]); //two default bytes
    }
   } else { // at least one must be the replacelement 0xefbfbd
    btmp[2] = in.read(); btmp[3] = in.read(); //there must be at least two further bytes
    if (btmp[0] != 0xef && btmp[1] == 0xef && btmp[2] == 0xbf && btmp[3] == 0xbd ||
        btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] != 0xef) {
     out.write(0xFD); out.write(0xFF);
    } else if (btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] == 0xef) {
     btmp[4] = in.read(); btmp[5] = in.read();
     if (btmp[4] == 0xbf &&  btmp[5] == 0xbd) {
      out.write(0xFD); out.write(0xFF);
     } else {
      throw new Exception("Wrong byte sequence: "
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]), 
      new Throwable().fillInStackTrace());
     }
    } else {
     throw new Exception("Wrong byte sequence: " 
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]),
      new Throwable().fillInStackTrace());
    }
   }
  }

  htmlDataBytes = out.toByteArray();

  //now get this as UTF_16LE (2 byte for each character, little endian)
  String html = new String(htmlDataBytes, Charset.forName("UTF-16LE"));
System.out.println(html);

  //replace all of the wrongUnicode with the unicodeChars selected from plainDataBytes
  boolean insideTag = false;
  int unicodeCharCount = 0;
  char[] textChars = html.toCharArray();
  StringBuffer newHTML = new StringBuffer();
  for (int i = 0; i < textChars.length; i++) {
   if (textChars[i] == '<') insideTag = true;
   if (textChars[i] == '>') insideTag = false;
   if (!insideTag && textChars[i] > 127) {
    if (unicodeCharCount >= unicodeChars.length()) 
     throw new Exception("Unicode chars count don't match. " 
      + "We got from plain text " + unicodeChars.length() + " chars. Text until now:\n" + newHTML,
      new Throwable().fillInStackTrace());

    newHTML.append(unicodeChars.charAt(unicodeCharCount++));
   } else {
    newHTML.append(textChars[i]);
   }
  }

  html = newHTML.toString();
System.out.println(html);

  return html.getBytes("UTF-8");

 }

 static void doSomethingWithUTF8BytesFromClipboard(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  if (plainDataBytes != null && htmlDataBytes != null) {

   String fileName; 
   OutputStream fileOut;

   fileName = "ResultPlainText.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(plainDataBytes, 0, plainDataBytes.length);
   fileOut.close();

   fileName = "ResultHTMLRaw.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

   //do we have wrong encoded UTF-8 in htmlDataBytes?
   if (htmlDataBytes[0] == (byte)0xef && htmlDataBytes[1] == (byte)0xbf && htmlDataBytes[2] == (byte)0xbd 
    && htmlDataBytes[3] == (byte)0xef && htmlDataBytes[4] == (byte)0xbf && htmlDataBytes[5] == (byte)0xbd) {
    //try repair the UTF-8 HTML data bytes
    htmlDataBytes = repairUTF8HTMLDataBytes(plainDataBytes, htmlDataBytes);
          //do we have additional 0x00 byte at the end?
   } else if (htmlDataBytes[htmlDataBytes.length-1] == (byte)0x00) {
    //do repair this
    htmlDataBytes = java.util.Arrays.copyOf(htmlDataBytes, htmlDataBytes.length-1);
   }

   fileName = "ResultHTML.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

  }

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  byte[] htmlDataBytes = null;
  byte[] plainDataBytes = null;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

   String mimeType = dataFlavor.getHumanPresentableName();

   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      htmlDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...

   } else if ("text/plain".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      plainDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...
   }
  }

  doSomethingWithUTF8BytesFromClipboard(plainDataBytes, htmlDataBytes);

 }

}

從Firefox復制並在Ubuntu中使用Java讀取時，剪貼板內容會混亂

問題描述

2 個解決方案

解決方案1
7 2018-07-29 15:23:26

解決方案2
4 2018-07-31 16:16:40

從Firefox復制並在Ubuntu中使用Java讀取時，剪貼板內容會混亂

問題描述

2 個解決方案

解決方案1 7 2018-07-29 15:23:26

解決方案2 4 2018-07-31 16:16:40

解決方案1
7 2018-07-29 15:23:26

解決方案2
4 2018-07-31 16:16:40