简体   繁体   English

HTML5和神秘的charset

[英]HTML5 and the mystery charset

This is my first post at SO so be gentle. 这是我在SO的第一篇文章,所以要温柔。

I am currently developing a web app that takes advantage of the new HTML5 target.result. 我目前正在开发一个利用新的HTML5 target.result的Web应用程序。 which allows me to read the content of a text file without having to upload anything to the server. 这允许我阅读文本文件的内容,而无需上传任何东西到服务器。

The issue I am having is regarding the charset. 我遇到的问题是关于charset。 So, usually web content is generated via the page itself as a blog post, comment or whatever which is compliant with the charset of that page and the database configuration. 因此,通常通过页面本身生成Web内容作为博客文章,评论或符合该页面的charset和数据库配置的任何内容。 However, this new HTML5 functionality allows us to get text file content without knowing the original charset or format of the document in question. 但是,这个新的HTML5功能允许我们在不知道相关文档的原始字符集或格式的情况下获取文本文件内容。

It makes sense to encode the data before it is posted by ajax so I have tried many different methods of converting the text to utf-8 and also via various dataTypes. 在数据被ajax发布之前对数据进行编码是有意义的,所以我尝试了许多不同的方法将文本转换为utf-8以及通过各种dataTypes。 I have already gone through the whole chartset road, htaccess, meta, content-type. 我已经完成了整个chartset之路,htaccess,meta,content-type。

It's no surprise that so many find the whole process of encoding, decoding URIs using different charsets, ASCII, different languages, and ajax dataTypes such a pain. 毫不奇怪,很多人发现编码的整个过程,使用不同的字符集,ASCII,不同的语言和ajax数据类型解码URI这样的痛苦。

I think the community could benefit from a solution that can obtain text from any type of text document regardless of charset or format, in any language and display it via an ajax request response in its original form with the added bonus of utf-8. 我认为社区可以从一个解决方案中受益,该解决方案可以从任何类型的文本文档中获取文本,无论字符串或格式如何,以任何语言显示,并通过原始形式的ajax请求响应以及utf-8的额外奖励显示它。 No weird symbols no one can read and finally bring an end to those diamond question marks. 没有人可以阅读的怪异符号,最终结束那些钻石问号。

Here is an example of where I am now. 这是我现在所处位置的一个例子。

Copy this news article: News Article 复制这篇新闻文章: 新闻文章

...and paste it here: swiss converter tool ...并将其粘贴在这里: 瑞士转换工具

No matter what configuration I use, I cannot get the apostrophes to display correctly in the bottom output, deja vu anyone? 无论我使用什么配置,我都无法在底部输出中正确显示撇号,似曾经有人吗?

So how did google solve this problem with google translator? 谷歌如何用谷歌翻译解决这个问题?

EDIT: It's also worth noting that the charsets of both ABC news and the swiss tool converter is utf-8. 编辑: 值得注意的是,ABC新闻和瑞士工具转换器的字符集都是utf-8。 And you can clearly see that converting from utf-8 to utf-8 also gives the strange symbols even though they are exactly the same charset. 你可以清楚地看到,从utf-8到utf-8的转换也给出了奇怪的符号,即使它们是完全相同的字符集。

EDIST: 2 Ok, so I managed to scramble a quick prototype and upload it to a remote server. EDIST:2好的,所以我设法加速了一个快速原型并将其上传到远程服务器。 You can access it at babblingo 你可以在babblingo访问它

This is the javascript that posts the text via ajax: 这是通过ajax发布文本的javascript:

function handleFileSelect(evt) {

evt.stopPropagation();
evt.preventDefault();

var files = evt.dataTransfer.files;

for (var i = 0, f; f = files[i]; i++) {
    var reader = new FileReader();
    reader.onload = (function(theFile) {
        return function(e) {
            var insertText = e.target.result;
            var fields = 'text=' + insertText;
            $.ajax({
                type: "POST",
                url: "ajax.php?action=addfile",
                data: fields,
                dataType: "json",
                complete: function (data) {
                    if (data.responseJSON.message) {
                        $( "#modal-message h4" ).replaceWith( "<h4 class='modal-title text-center'>"+data.responseJSON.message+"</h4>" );
                    }
                    if (data.responseJSON.report) {
                        $( "#report_box" ).replaceWith( '<div id="report_box">'+data.responseJSON.report+'</div>' );
                    }
                    if (data.responseJSON.import) {
                        $('#output_box').replaceWith('<div id="output_box" class="hidden-print">'+data.responseJSON.import+'</div>');
                    }
                    $('#modal-message').modal('show');
                    setTimeout(function() {$('#modal-message').modal('hide');}, 3000);
                }
            });


        };
    })(f);

    reader.readAsText(f);
}
}

Since no one has answered this, I'll venture an answer based upon similar work that I've done with on-the-fly translations for a legacy application that does not understand utf-8 yet generates html. 由于没有人回答过这个问题,我将根据类似的工作找到答案,我已经完成了对不懂utf-8但仍然生成html的遗留应用程序的即时翻译。

It simply involved creating a mapping table from the problematic character code to it's html entity equivalent. 它只涉及创建一个映射表,从有问题的字符代码到它的html实体等价。 ñ => &ntilde; ñ=>&ntilde; for example. 例如。 Here's some sample code. 这是一些示例代码。

function createEntities(source) {
    var map = [
       { key:"á", value: "<b>&aacute;</b>"},
       { key:"ñ", value: "<b>&ntilde;</b>"},
        { key:"ó", value: "<b>&oacute;</b>" },
       { key:"'", value: "<b>&apos;</b>" }
    ];
    var target = source;
    for ( prop in map ) {
       var pair = map[prop];
       target = target.replace(pair.key,pair.value)
    }
    return target;
}

Here is a jsFiddle demonstrating this. 这是一个证明这一点的jsFiddle You'll need to setup the appropriate mappings, of course. 当然,您需要设置适当的映射。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM