简体   繁体   English

尝试从 UNIX 文件中删除不可打印的字符(垃圾值)

[英]Trying to remove non-printable characters (junk values) from a UNIX file

I am trying to remove non-printable character (for eg ^@ ) from records in my file.我正在尝试从我的文件中的记录中删除不可打印的字符(例如^@ )。 Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.由于文件中的记录量太大,使用 cat 不是一种选择,因为循环花费了太多时间。 I tried using我尝试使用

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed.但仍然没有删除^@字符。 Also I tried using我也尝试使用

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE 

but it also did not help.但这也无济于事。

Can anybody suggest some alternative way to remove non-printable characters?有人可以建议一些替代方法来删除不可打印的字符吗?

Used tr -cd but it is removing accented characters.使用tr -cd但它正在删除重音字符。 But they are required in the file.但它们在文件中是必需的。

Perhaps you could go with the complement of [:print:] , which contains all printable characters:也许你可以使用[:print:]的补充,它包含所有可打印的字符:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):如果您的tr版本不支持多字节字符(似乎很多不支持),这对我来说适用于 GNU sed(使用 UTF-8 语言环境设置):

sed 's/[^[:print:]]//g' file

Remove all control characters first:首先删除所有控制字符:

tr -dc '\007-\011\012-\015\040-\376' < file > newfile

Then try your string:然后试试你的字符串:

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile

I believe that what you see ^@ is in fact a zero value \\0 .我相信你看到的^@实际上是一个零值\\0
The tr filter from above will remove those as well.上面的tr过滤器也将删除这些。

strings -1 file... > outputfile

seems to work.似乎工作。 The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them.字符串程序将获取所有可打印的字符,在这种情况下长度为 1(-1 参数)并打印它们。 It effectively is removing all the non-printable characters.它有效地删除了所有不可打印的字符。

"man strings" will provide the documentation. “人字符串”将提供文档。

Was searching for this for a while & found a rather simple solution:搜索了一段时间并找到了一个相当简单的解决方案:

The package ansifilter does exactly this. ansifilter包正是这样做的。 All you need to do is just pipe the output through it.您需要做的就是通过它管道输出。

On Mac:在 Mac 上:

brew install ansifilter

Then:然后:

cat file.txt | ansifilter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM