简体   繁体   中英

How do I reverse escape backslash encodings like “\ ” and “\303\266” in bash?

I have a script that records files with UTF8 encoded names. However the script's encoding / environment wasn't set up right, and it just recoded the raw bytes. I now have lots of lines in the file like this:

.../My\ Folders/My\ r\303\266m/...

So there are spaces in the filenames with \\ and UTF8 encoded stuff like \\303\\266 (which is ö ). I want to reverse this encoding? Is there some easy set of bash command line commands I can chain together to remove them?

I could get millions of sed commands but that'd take ages to list all the non-ASCII characters we have. Or start parsing it in python. But I'm hoping there's some trick I can do.

Here's a rough stab at the Unicode characters:

text="/My\ Folders/My\ r\303\266m/"
text="echo \$\'"$(echo "$text"|sed -e 's|\\|\\\\|g')"\'"
# the argument to the echo must not be quoted or escaped-quoted in the next step
text=$(eval "echo $(eval "$text")")
read text < <(echo "$text")
echo "$text"

This makes use of the $'string' quoting feature of Bash.

This outputs "/My Folders/My röm/".

As of Bash 4.4, it's as easy as:

text="/My Folders/My r\303\266m/"
echo "${text@E}"

This uses a new feature of Bash called parameter transformation . The E operator causes the parameter to be treated as if its contents were inside $'string' in which backslash escaped sequences, in this case octal values, are evaluated.

It is not clear exactly what kind of escaping is being used. The octal character codes are C, but C does not escape space. The space escape is used in the shell, but it does not use octal character escapes.

Something close to C-style escaping can be undone using the command printf %b $escaped . (The documentation says that octal escapes start with \\0 , but that does not seem to be required by GNU printf.) Another answer mentions read for unescaping shell escapes, although if space is the only one that is not handled by printf %b then handling that case with sed would probably be better.

In the end I used something like this:

cat file | sed 's/%/%%/g' | while read -r line ; do printf "${line}\n" ; done | sed 's/\\ / /g'

Some of the files had % in them, which is a printf special character, so I had to 'double it up' so that it would be escaped and passed straight through. The -r in read stops read escaping the \\ 's however read doesn't turn "\\ " into " " , so I needed the final sed .

Use printf to solve the issue with utf-8 text. Use read to take care of spaces (\\ ) .

Like this:

$ text='/My\ Folders/My\ r\303\266m/'
$ IFS='' read t < <(printf "$text")
$ echo "$t"
/My Folders/My röm/

The built-in 'read' function will handle part of the problem:

$ echo "with\ spaces" | while read r; do echo $r; done
with spaces

Pass the file (line by line) to the following perl script.

#!/usr/bin/per

sub encode {
    $String = $_[0];
    $_ = $String;
    while(/(\\[0-9]+|.)/g) {
        $Match = $1;

        if ($Match =~ /\\([0-9]+)/) {
            $Code = oct(0 + $1);
            $Char = ((($Code >= 32) && ($Code  160))
                ? chr($Code)
                : sprintf("\\x{%X}", $Code);
            printf("%s", $Char);
        } else {
            print "$Match";
        }
    }

    print "\n";
}

while ($#ARGV >= 0) {
    $File = shift();
    open(my $F, ") {
        $String =~ s/\\ / /g;
        &encode($Line);
    }
}

Like this:

$ ./PerlEncode.pl Test.txt

Where Test.txt contains:

/My\ Folders/My\ r\303\266m/
/My\ Folders/My\ r\303\266m/
/My\ Folders/My\ r\303\266m/

The line "$String =~ s/\\ / /g;" replace "\\ " with " " and sub encode parse those unicode char.

Hope this help

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM