Sed无法使用特殊字符

Question

我有一个示例文本文件，其中一些数字编码为Non Ascii字符。 我具有用于编码文件的字符映射，但是当我使用sed替换这些字符中的每个字符时，我得到了意外的结果。

像这些

 ¤»¤ ¡  1 3

3ô1ô ôôôôô1ô
ôôôô
                       ôôôôô¤ôôôôô»ôôôôô¤ôôôôôô ô¡ ô 1 3ô

我尝试过的命令是这些

sed -r 's/`echo ô`/5/g' new.txt
sed -r 's/\ô/5/g' new.txt

还可以

perl -pe 's/\ô/5/g' < new.txt

为此，我需要帮助。 谢谢。

Answer 1

我认为解决此问题的方法是首先以明确的形式获取字符（在两个文件中）。 然后遍历映射文件，将每个明确的字符添加到具有所述值的哈希中。 最后，遍历明确的示例字符（明确的字符的长度为16），并用其哈希值替换每个字符。 如果示例文件包含ASCII字符（即，其明确形式的长度不是16的长度），则可以将其破坏。 您可能需要根据输入内容来解决此问题，但是如果示例文本指示实际文件，则应该没有任何问题。 如果结果与预期不符，请告诉我。

运行像：

./translate.pl CharMap.txt sample.txt

translate.pl内容：

#!/usr/bin/perl
use strict;
use warnings;

# open the files up for reading.
# ARGV[0] points to the first file listed, 'CharMap.txt'
# ARGV[1] points to the second file listed, 'sample.txt'
open CHARMAP, $ARGV[0] or die;
open SAMPLE, $ARGV[1] or die;

# execute `sed -n 'l0'` on each file and capture output into two arrays
# the '-n' flag suppresses printing of pattern space
# the 'l0' command simply means print the pattern space in an unambiguous form
my @charmap = `sed -n 'l0' $ARGV[0]`;
my @sample = `sed -n 'l0' $ARGV[1]`;

# declare a hash
my %charhash;

# loop through the array of character mappings
for (@charmap) {
    # use a subroutine to sanitize each element
    $_ = sanitize($_);
    # add each unambiguous character to a hash with its mapping pair
    $charhash{ substr $_, 2 } = substr $_, 0, 1;
}

# now loop through the unambiguous sample data
# in your sample file there is only a single element so the loop is unnecessary
for (@sample) {
    # use a subroutine to sanitize each element
    $_ = sanitize($_);
    # so each unambiguous character is 16 readable characters longs.
    # so we need to loop through 16 chars at a time. These can be stored in $1. 
    # then we ask the hash 'what is the value of the element $1?
    # we then print this value.
    print $charhash{$1} while $_ =~ /(.{16})/g;

    # print a newline char to replace the chomped input
    print "\n";
}

close CHARMAP;
close SAMPLE;

sub sanitize {

    # read in the element passed to the subroutine
    my $line = shift;

    # remove newline endings
    chomp $line;

    # for some reason your files contained this transparent 12 digit unreadable
    # unambiguous character right at the start of the two files. I do not know
    # what it is or what it looks like, but for convenience, I simply remove it
    # from every line, even if I only found on the first line.
    $line =~ s/^\\357\\273\\277//;

    # trim off a trailing line ending
    $line =~ s/\$$//;

    # trim off a trailing newline ending
    $line =~ s/\\r$//;

    return $line;
}

结果：

3177191281013,997,094

在sed手册中可以找到有关sed l0更多信息

Sed无法使用特殊字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-09-18 04:14:36

Sed无法使用特殊字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-09-18 04:14:36

解决方案1
1 已采纳 2012-09-18 04:14:36