How can I replace all the HTML-encoded accents in Perl?

Question

I have the following situation:

There is a tool that gets an XSLT from a web interface and embeds the XSLT in an XML file (Someone should have been fired). "Unfortunately" I work in a French speaking country and therefore the XSLT has a number of words with accents. When the XSLT is embedded in the XML, the tool converts all the accents to their HTML codes (Iacute, igrave, etc...) .

My Perl code is retrieving the XSLT from the XML and is executing it against an other XML using Xalan command line tool. Every time there is some accent in the XSLT the Xalan tool throws an exception.

I initially though to do a regexp to change all the accents in the XSLT usch as:

# the & is omitted in the codes becuase it will be rendered in the page
$xslt =~s/Aacute;/Á/gso;
$xslt =~s/aacute;/á/gso;
$xslt =~s/Agrave;/À/gso;
$xslt =~s/Acirc;/Â/gso;
$xslt =~s/agrave;/à/gso;

but doing so means that I have to write a regexp for each of the accent codes....

My question is, is there anyway to do this without writing a regexp per code? (thinking that is the only solution makes be want to vomit.)

By the way the tool is TeamSite, and it sucks.....

I forgot to mention that I need to have a Perl only solution, security does not let me install any type of libs they have not checked for a week or so :( 我忘了提到我需要一个Perl唯一的解决方案，安全性不允许我安装他们没有检查一周左右的任何类型的库:(

Answer 1

You can try something like HTML::Entities . From the POD:

use HTML::Entities;
$a = "V&aring;re norske tegn b&oslash;r &#230res";
decode_entities($a);
#encode_entities($a, "\200-\377");  ## not needed for what you are doing

In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:

perl -MHTML::Entities -le 'print "If this prints, the it is installed"'

Answer 2

For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s/// statements

# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
  'Aacute;' => 'Á',
  'aacute;' => 'á',
  'Agrave;' => 'À',
  'Acirc;' => 'Â',
  'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map

my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;

# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet

Edit : There is no need of e regexp modifier as mentioned by Chas. Owens .

Answer 3

I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?

CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.

Answer 4

为什么有人会因将XSL（XML）放入XML文件而被解雇？

How can I replace all the HTML-encoded accents in Perl?

Question

4 answers

solution1
6 ACCPTED 2009-01-28 14:44:16

solution2
1 2009-01-28 16:23:33

solution3
0 2009-01-28 15:29:35

solution4
0 2009-01-29 11:46:26

How can I replace all the HTML-encoded accents in Perl?

Question

4 answers

solution1 6 ACCPTED 2009-01-28 14:44:16

solution2 1 2009-01-28 16:23:33

solution3 0 2009-01-28 15:29:35

solution4 0 2009-01-29 11:46:26

solution1
6 ACCPTED 2009-01-28 14:44:16

solution2
1 2009-01-28 16:23:33

solution3
0 2009-01-28 15:29:35

solution4
0 2009-01-29 11:46:26