简体   繁体   中英

HTML5 and the mystery charset

This is my first post at SO so be gentle.

I am currently developing a web app that takes advantage of the new HTML5 target.result. which allows me to read the content of a text file without having to upload anything to the server.

The issue I am having is regarding the charset. So, usually web content is generated via the page itself as a blog post, comment or whatever which is compliant with the charset of that page and the database configuration. However, this new HTML5 functionality allows us to get text file content without knowing the original charset or format of the document in question.

It makes sense to encode the data before it is posted by ajax so I have tried many different methods of converting the text to utf-8 and also via various dataTypes. I have already gone through the whole chartset road, htaccess, meta, content-type.

It's no surprise that so many find the whole process of encoding, decoding URIs using different charsets, ASCII, different languages, and ajax dataTypes such a pain.

I think the community could benefit from a solution that can obtain text from any type of text document regardless of charset or format, in any language and display it via an ajax request response in its original form with the added bonus of utf-8. No weird symbols no one can read and finally bring an end to those diamond question marks.

Here is an example of where I am now.

Copy this news article: News Article

...and paste it here: swiss converter tool

No matter what configuration I use, I cannot get the apostrophes to display correctly in the bottom output, deja vu anyone?

So how did google solve this problem with google translator?

EDIT: It's also worth noting that the charsets of both ABC news and the swiss tool converter is utf-8. And you can clearly see that converting from utf-8 to utf-8 also gives the strange symbols even though they are exactly the same charset.

EDIST: 2 Ok, so I managed to scramble a quick prototype and upload it to a remote server. You can access it at babblingo

This is the javascript that posts the text via ajax:

function handleFileSelect(evt) {

evt.stopPropagation();
evt.preventDefault();

var files = evt.dataTransfer.files;

for (var i = 0, f; f = files[i]; i++) {
    var reader = new FileReader();
    reader.onload = (function(theFile) {
        return function(e) {
            var insertText = e.target.result;
            var fields = 'text=' + insertText;
            $.ajax({
                type: "POST",
                url: "ajax.php?action=addfile",
                data: fields,
                dataType: "json",
                complete: function (data) {
                    if (data.responseJSON.message) {
                        $( "#modal-message h4" ).replaceWith( "<h4 class='modal-title text-center'>"+data.responseJSON.message+"</h4>" );
                    }
                    if (data.responseJSON.report) {
                        $( "#report_box" ).replaceWith( '<div id="report_box">'+data.responseJSON.report+'</div>' );
                    }
                    if (data.responseJSON.import) {
                        $('#output_box').replaceWith('<div id="output_box" class="hidden-print">'+data.responseJSON.import+'</div>');
                    }
                    $('#modal-message').modal('show');
                    setTimeout(function() {$('#modal-message').modal('hide');}, 3000);
                }
            });


        };
    })(f);

    reader.readAsText(f);
}
}

Since no one has answered this, I'll venture an answer based upon similar work that I've done with on-the-fly translations for a legacy application that does not understand utf-8 yet generates html.

It simply involved creating a mapping table from the problematic character code to it's html entity equivalent. ñ => &ntilde; for example. Here's some sample code.

function createEntities(source) {
    var map = [
       { key:"á", value: "<b>&aacute;</b>"},
       { key:"ñ", value: "<b>&ntilde;</b>"},
        { key:"ó", value: "<b>&oacute;</b>" },
       { key:"'", value: "<b>&apos;</b>" }
    ];
    var target = source;
    for ( prop in map ) {
       var pair = map[prop];
       target = target.replace(pair.key,pair.value)
    }
    return target;
}

Here is a jsFiddle demonstrating this. You'll need to setup the appropriate mappings, of course.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM