How to Encode and Decode “Acute accented characters” using Perl

Question

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\\ Language in our website.

What we have done in

IF we have a Tab name like Courses Main Page<\\h1> in our template file, we have converted that to

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>

where $language is getting the value which user selects when he logs in.

We have a table to maintain this data in our Mysql DB:

CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of Element text to a foreign language'

在此处输入图片说明

IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'. I tried with utf8, but the issue which is limited to some tabs got increased to many sections.

So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.

Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.

For ex: in Front end we are seeing "Cursos PÃ¡gina Principal" for Cursos Página Principal.

It is very similar to the solution given in htmlentities and é (e acute)

Can any one tell me how to achieve the same in Perl.

Answer 1

Denoting the charset

For ex: in Front end we are seeing "Cursos PÃ¡gina Principal" for Cursos Página Principal.

This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, ie

Content-Type: text/html; charset=utf-8

If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.

In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.

Actually encoding and decoding

There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice , so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:

use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);

Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.

One example on what these functions do:

$byteString    = "Cursos P\xc3\xa1gina Principal.";   # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString    = "Cursos P&#225;gina Principal.";     # html character reference

How to Encode and Decode “Acute accented characters” using Perl

Question

1 answers

solution1
6 ACCPTED 2013-03-06 13:12:40

Denoting the charset

Actually encoding and decoding

How to Encode and Decode “Acute accented characters” using Perl

Question

1 answers

solution1 6 ACCPTED 2013-03-06 13:12:40

Denoting the charset

Actually encoding and decoding

solution1
6 ACCPTED 2013-03-06 13:12:40