[英]Remove a range of escape / non-printable chars in sed
I am working from horrible text data (2GB csv file) which includes practically all escape chars 0x00-0x1F spattered throughout the file. 我正在使用可怕的文本数据(2GB的csv文件)工作,实际上包括整个文件中分散的所有转义字符0x00-0x1F。 I attempted to read this into R for processing but cannot due to the EOFs (0x04): 我试图将其读入R进行处理,但由于EOF(0x04)而不能:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
So I thought sed would be a good use to remove all the non-printable junk in the file, but there seems to be some strangeness in how to represent the escape chars in the sed syntax. 因此,我认为sed可以很好地删除文件中所有不可打印的垃圾,但是在sed语法中如何表示转义字符似乎有些奇怪。 I have tried all of the following which do not seem to work: 我尝试了以下所有似乎无效的方法:
Include only specified chars: 仅包括指定的字符:
sed 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' IN.csv > OUT.csv
Identify range of non-printable in decimal or hex: 以十进制或十六进制标识不可打印的范围:
cat IN.csv | sed 's/[\d0-\d31]//g' > OUT.csv
cat IN.csv | sed s/[$'\x00'-$'\x1F']//g OUT.csv
cat IN.csv | sed 's/\x00-\x1F//g' > OUT.csv
and using Ctrl-V Ctrl-D to produce this: 并使用Ctrl-V Ctrl-D生成此代码:
cat IN.csv | sed s/^D//g > OUT.csv
All the commands appear to execute, but the resulting file output does not remove the non-printable chars and appears to change the output in ways unexpected. 所有命令似乎都在执行,但是结果文件输出不会删除不可打印的字符,并且似乎以意想不到的方式更改了输出。
What I found that DOES WORK is this: 我发现确实可以做到这一点:
cat IN.csv | sed 's/'`echo -e "\x04"`'//g' > OUT.csv
or this: 或这个:
cat IN.csv | sed 's/\x04//g' > test3.csv
However this only works for a single escape char. 但是,这仅适用于单个转义字符。 Is there a better way to address all of the non-printable chars at the same time in a single range without having to execute 1 command for each non-printable? 有没有更好的方法可以在单个范围内同时处理所有不可打印的字符,而不必为每个不可打印的字符执行1个命令? I assume I must not be entering the syntax for a range properly. 我假设我不能正确输入范围的语法。
For removal (and transliteration) there is a better tool called tr
(translate or delete characters). 对于删除(和音译),有一个更好的工具叫做tr
(翻译或删除字符)。 You can remove non-printable characters using: 您可以使用以下方法删除不可打印的字符:
cat IN.csv | tr -cd '\11\12\15\40-\176' > OUT.csv
-d
- deletes characters mentioned, -c
inverts the ranges. -d
删除提到的字符, -c
反转范围。
Or using the POSIX [:print:]
: 或使用POSIX [:print:]
:
cat IN.csv | tr -cd '[:print:]' > OUT.csv
You could try awk
: 您可以尝试awk
:
awk '{gsub(/[[:punct:]]/,"")}1' your_file
or try sed
: 或尝试sed
:
sed "s/[^a-z|0-9]//g;" orig_file > new_file
or try perl: 或尝试perl:
perl -pe 's/[^A-Za-z0-9\s]//g' orig_file > new_file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.