Perl：將（高）十進制NCR轉換為UTF-8

Question

我有這個字符串（十進制NCR）： 日本の鍼灸とは

它代表日本文本日本の鍼灸とは 。

但我需要（UTF-8）： %E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF

對於第一個角色： 日 ⇒ 日 ⇒ %E6%97%A5

這個網站做到了，但我如何在Perl中獲得這個？ （如果可能在單個正則表達式中，如s/\\&\\#([0-9]+);/uc('%'.unpack("H2", pack("c", $1)))/eg; ）

http://www.endmemo.com/unicode/unicodeconverter.php

此外，我需要將其從UTF-8再次轉換回十進制NCR

我現在已經半天打破了這一天，任何幫助都非常感謝！

Answer 1

您所謂的“UTF-8”實際上是URL編碼。

HTML實體（ 日 ）⇒文本（ 日 ） 日組件（ %E6%97%A5 ）：

use HTML::Entities qw( decode_entities );
use URI::Escape    qw( uri_escape_utf8 );

my $text = decode_entities($html);
my $uri_component = uri_escape_utf8($text);

URI組件（ %E6%97%A5 ）⇒文本（ 日 ） 日實體（ 日 ）：

use Encode         qw( decode_utf8 );
use HTML::Entities qw( encode_entities );
use URI::Escape    qw( uri_unescape );

my $text = decode_utf8(uri_unescape($uri_component));
my $html = encode_entities($text);

Answer 2

#!/usr/bin/perl
use strict;
use warnings;

use Test::More tests => 2;
use Encode qw{ encode decode };

my $in = '&#26085;&#26412;&#12398;&#37756;&#28792;&#12392;&#12399;'; # 日本の鍼灸とは
my $out = '%E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF';

(my $utf = $in) =~ s/&#(.*?);/chr $1/ge;

my $r = join q(), map { sprintf '%%%2X', ord } split //, encode('utf8', $utf);
is($r, $out);

(my $s = $r) =~ s/%(..)/chr hex $1/ge;
$s = decode('utf8', $s);
$s = join q(), map '&#' . ord . ';', split //, $s;
is($s, $in);

Perl：將（高）十進制NCR轉換為UTF-8

問題描述

2 個解決方案

解決方案1
3 2015-03-19 13:28:06

解決方案2
0 已采納 2015-03-19 13:19:52

Perl：將（高）十進制NCR轉換為UTF-8

問題描述

2 個解決方案

解決方案1 3 2015-03-19 13:28:06

解決方案2 0 已采納 2015-03-19 13:19:52

解決方案1
3 2015-03-19 13:28:06

解決方案2
0 已采納 2015-03-19 13:19:52