从Word文档复制到网页时文本发生变化

Question

I am creating a blog engine and it includes a <textarea> which takes in the input of the whole article. 我正在创建一个博客引擎，它包含一个<textarea> ，它接受了整篇文章的输入。

I then use ajax and store it to the Text variable provided by the GAE datastore 然后，我使用ajax并将其存储到GAE数据存储区提供的Text变量中

The Problem: If a user copies the text from a word document, them I see various random characters on the screen when embedded on the web page. 问题：如果用户从Word文档中复制文本，则当他们嵌入网页时，他们在屏幕上会看到各种随机字符。 I know this is because the word file uses XML encoding and a HTML page uses utf-8 encoding(in my case) 我知道这是因为word文件使用XML编码，而HTML页面使用utf-8编码（在我的情况下）

The question: How do I change the encoding of the inputted text? 问题：如何更改输入文本的编码？ Or how can I avoid the XML encoding? 或者如何避免XML编码？ Or if changing the encoding of my web page might help solve this problem? 或者，如果更改网页的编码可能有助于解决此问题？

Points to be noted: I want to make it automated.. I have read on Google that you should 1st copy the text to some simple text editor which formats the encoding and them copy it to the web page. 需要注意的要点：我想使其自动化。在Google上阅读过，您应该首先将文本复制到一些简单的文本编辑器中，该编辑器会格式化编码并将它们复制到网页上。 But this option is not feasible for me. 但是此选项对我而言不可行。

Also I have used weebly before, and that time I copied text from a word file, if someone knows how weebly manages the encoding conflict! 同样，我以前曾经使用过weebly，而那时，如果有人知道weebly如何管理编码冲突，那么我会从word文件中复制文本！

Answers are expected in java :) 答案应该在java :)

Answer 1

that is because word documment ' (comma) are not covered in UTF - 8 standards so you need to handle it in programmatic way. 这是因为UTF-8标准未涵盖单词documment'（逗号），因此您需要以编程方式进行处理。

below is some example on javascript 以下是关于javascript的一些示例

<textarea rows="4" onkeyup="replaceWordChars(this.value)" cols="50">
//your text area
</textarea> 


function replaceWordChars(text) {
    var s = text;
    // smart single quotes and apostrophe
    s = s.replace(/[\u2018|\u2019|\u201A]/g, "\'");
    // smart double quotes
    s = s.replace(/[\u201C|\u201D|\u201E]/g, "\"");
    // ellipsis
    s = s.replace(/\u2026/g, "...");
    // dashes
    s = s.replace(/[\u2013|\u2014]/g, "-");
    // circumflex
    s = s.replace(/\u02C6/g, "^");
    // open angle bracket
    s = s.replace(/\u2039/g, "<");
    // close angle bracket
    s = s.replace(/\u203A/g, ">");
    // spaces
    s = s.replace(/[\u02DC|\u00A0]/g, " ");
    document.getElementById("your Textarea ID ").value = s;
}

on text area you need to fire this javascript function onKeyup event 在文本区域上，您需要触发此javascript函数onKeyup事件

Answer 2

Not sure if this will help anyone, but I spent a few days trying to figure out this issue. 不知道这是否会帮助任何人，但是我花了几天的时间来弄清楚这个问题。 My use case was very similar except I discovered my problem related to the way the clipboard copied (this changed slightly depending upon OS) and subsequently pasted the text. 我的用例非常相似，除了我发现我的问题与剪贴板复制（取决于操作系统而稍有变化）并随后粘贴文本的方式有关。 (I used ClipSpy to investigate what was happening "under the hood".) （我使用ClipSpy调查了“ 幕后 ”的情况。）

Forgive my layman's explanation: The clipboard stores text in multiple formats and when the paste command is given it attempts to match the charset/encoding of the recipient program, or in my case <textarea> box of my webpage. 请原谅我的外行解释：剪贴板以多种格式存储文本，并且在发出粘贴命令时，剪贴板将尝试匹配收件人程序的字符集/编码，或者与我的网页的<textarea>框匹配。 These sites and forum posts helped immensely: 这些站点和论坛帖子极大地帮助了：

Ultimately all I had to do was declare early on <head> <meta charset="UTF-8"> </head> let the browser do the "hard" work for me, by expecting UTF-8 encoded text and the clipboard attempts to honour that. 最终，我要做的就是尽早在<head> <meta charset="UTF-8"> </head>让浏览器通过对UTF-8编码的文本和剪贴板的尝试来为我完成“艰苦”的工作。为了纪念这一点。

从Word文档复制到网页时文本发生变化

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-10-12 09:58:30

解决方案2
0 2015-10-19 09:37:09

从Word文档复制到网页时文本发生变化

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-10-12 09:58:30

解决方案2 0 2015-10-19 09:37:09

解决方案1
1 已采纳 2013-10-12 09:58:30

解决方案2
0 2015-10-19 09:37:09