简体   繁体   English

如何使用perl的正则表达式匹配汉字

[英]How to match Chinese character using perl's regex

I need to match some chinese character in a utf8 encoded html , and I wrote some test code as below : 我需要在utf8编码的html中匹配一些中文字符,我写了一些测试代码如下:

#! /usr/bin/perl

use strict;
use LWP::UserAgent;
use Encode;

my $ua = new LWP::UserAgent;

my $request = HTTP::Request->new('GET');
my $url = 'http://www.boc.cn/sourcedb/whpj/';
$request->url($url);

my $res = $ua->request($request) ;

my $str_chinese =   encode("utf8" ,"英磅" ) ;  
# my $str_chinese = "英磅" ;


my $str_english = "English" ;
#my $html = decode("utf8" , $res->content) ;
my $html = $res->content ; 

if ( $html =~ /$str_chinese/ ) {
     print "chinese word matched" ;
}else {
     print "chinese word unmatched\n" ;
}

if ( $html =~ /$str_english/i ) {
    print "english word matched\n" ;
}else {
    print "english word unmatched\n" ;
}

The output shows that the the script fail to match the existing chinese characters embeded in the html. 输出显示脚本无法匹配html中嵌入的现有中文字符。 could you give me some hint on how to solve my problem ? 你能给我一些如何解决我的问题的提示吗?

Since you have added UTF-8 characters in the source code, you have to: 由于您在源代码中添加了UTF-8字符,因此您必须:

use utf8;

It tells Perl that your script is written in UTF-8. 它告诉Perl您的脚本是用UTF-8编写的。

I run your code and the Chinese characters are not matched. 我运行你的代码,中文字符不匹配。

Then I check the html, it does not contains these characters. 然后我检查html,它不包含这些字符。 So this may be the reason for non-matching case. 所以这可能是不匹配案例的原因。 I then tried for some other character (联) and also remove the encode function. 然后我尝试了一些其他角色(联)并删除了编码功能。 ie my $str_chinese = "联"; my $str_chinese = "联";

Run the code with this change and the character is matched. 使用此更改运行代码并匹配字符。

You should use the method decoded_content from the class HTTP::Message instead. 您应该使用HTTP::Message类中的decoded_content方法。 Manual decoding is not necessary. 不需要手动解码。

#!/usr/bin/env perl
use utf8;
use strict;
use LWP::UserAgent;

my $html = LWP::UserAgent->new
    ->get('http://www.boc.cn/sourcedb/whpj/')
    ->decoded_content;

my $str_chinese = '首页';
my $str_english = 'English';

if ($html =~ /$str_chinese/) {
    print "chinese word matched\n";
} else {
    print "chinese word unmatched\n";
}

if ($html =~ /$str_english/i) {
    print "english word matched\n";
} else {
    print "english word unmatched\n";
}

Output: 输出:

chinese word matched
english word matched

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM