简体   繁体   English

Sed无法使用特殊字符

[英]Sed not working with special Characters

I have a sample text file with some numbers encoded as Non Ascii characters. 我有一个示例文本文件 ,其中一些数字编码为Non Ascii字符。 I Have the character map used to encode the file but when I am using sed to replace each of these characters, I am getting unexpected results. 我具有用于编码文件的字符映射 ,但是当我使用sed替换这些字符中的每个字符时,我得到了意外的结果。

like these 像这些

 ¤»¤ ¡  1 3

3ô1ô ôôôôô1ô
ôôôô
                       ôôôôô¤ôôôôô»ôôôôô¤ôôôôôô ô¡ ô 1 3ô

The commands which I have tried are these 我尝试过的命令是这些

sed -r 's/`echo ô`/5/g' new.txt
sed -r 's/\ô/5/g' new.txt

also perl 还可以

perl -pe 's/\ô/5/g' < new.txt

I need help for this please. 为此,我需要帮助。 Thanks. 谢谢。

I think the way to solve this would be to first get the characters (in both files) in an unambiguous form. 我认为解决此问题的方法是首先以明确的形式获取字符(在两个文件中)。 Then iterate through the mapping file, adding each unambiguous character to a hash with it's said value. 然后遍历映射文件,将每个明确的字符添加到具有所述值的哈希中。 Finally, loop through the unambiguous sample characters (the size of an unambiguous character has a length of 16), replacing each one with it's hashed value. 最后,遍历明确的示例字符(明确的字符的长度为16),并用其哈希值替换每个字符。 This can be broken if the sample file were to contain ASCII characters (ie where the length of it's unambiguous form is not 16). 如果示例文件包含ASCII字符(即,其明确形式的长度不是16的长度),则可以将其破坏。 You may need to fix this depending on your input but if your sample text is indicative of your actual file, you shouldn't have any problems. 您可能需要根据输入内容来解决此问题,但是如果示例文本指示实际文件,则应该没有任何问题。 Please let me know if the results are not what you were expecting. 如果结果与预期不符,请告诉我。

Run like: 运行像:

./translate.pl CharMap.txt sample.txt

Contents of translate.pl : translate.pl内容:

#!/usr/bin/perl
use strict;
use warnings;

# open the files up for reading.
# ARGV[0] points to the first file listed, 'CharMap.txt'
# ARGV[1] points to the second file listed, 'sample.txt'
open CHARMAP, $ARGV[0] or die;
open SAMPLE, $ARGV[1] or die;

# execute `sed -n 'l0'` on each file and capture output into two arrays
# the '-n' flag suppresses printing of pattern space
# the 'l0' command simply means print the pattern space in an unambiguous form
my @charmap = `sed -n 'l0' $ARGV[0]`;
my @sample = `sed -n 'l0' $ARGV[1]`;

# declare a hash
my %charhash;

# loop through the array of character mappings
for (@charmap) {
    # use a subroutine to sanitize each element
    $_ = sanitize($_);
    # add each unambiguous character to a hash with its mapping pair
    $charhash{ substr $_, 2 } = substr $_, 0, 1;
}

# now loop through the unambiguous sample data
# in your sample file there is only a single element so the loop is unnecessary
for (@sample) {
    # use a subroutine to sanitize each element
    $_ = sanitize($_);
    # so each unambiguous character is 16 readable characters longs.
    # so we need to loop through 16 chars at a time. These can be stored in $1. 
    # then we ask the hash 'what is the value of the element $1?
    # we then print this value.
    print $charhash{$1} while $_ =~ /(.{16})/g;

    # print a newline char to replace the chomped input
    print "\n";
}

close CHARMAP;
close SAMPLE;

sub sanitize {

    # read in the element passed to the subroutine
    my $line = shift;

    # remove newline endings
    chomp $line;

    # for some reason your files contained this transparent 12 digit unreadable
    # unambiguous character right at the start of the two files. I do not know
    # what it is or what it looks like, but for convenience, I simply remove it
    # from every line, even if I only found on the first line.
    $line =~ s/^\\357\\273\\277//;

    # trim off a trailing line ending
    $line =~ s/\$$//;

    # trim off a trailing newline ending
    $line =~ s/\\r$//;

    return $line;
}

Result: 结果:

3177191281013,997,094

Some more info can be found about sed l0 in the sed manual sed手册中可以找到有关sed l0更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM