简体   繁体   English

搜索非ascii字符

[英]Searching for non-ascii characters

I have a file, a.out, which contains a number of lines. 我有一个文件,a.out,其中包含许多行。 Each line is one character only, either the unicode character U+2013 or a lower case letter az . 每行只有一个字符,可以是unicode字符U+2013或小写字母az

Doing a file command on a.out elicits the result UTF-8 Unicode text. 在a.out上执行文件命令会引发结果UTF-8 Unicode文本。

The locale command reports: locale命令报告:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

If I issue the command grep -P -n "[^\\x00-\\xFF]" a.out I would expect only the lines containing U+2013 to be returned. 如果我发出命令grep -P -n "[^\\x00-\\xFF]" a.out我希望只返回包含U+2013的行。 And this is the case if I carry out the test under cygwin. 如果我在cygwin下进行测试就是这种情况。 The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. 然而,问题环境是Oracle Linux Server 6.5版,问题是grep命令不返回任何行。 If I issue grep -P -n "[\\x00-\\xFF] " a.out then all lines are returned. 如果我发出grep -P -n "[\\x00-\\xFF] ”a.out,则返回所有行。

I realise that " [grep -P] ...is highly experimental and grep -P may warn of unimplemented features." 我意识到“ [grep -P] ......是高度实验性的, grep -P可能会警告未实现的功能。” but no warnings are issued. 但没有发出警告。

Am I missing something? 我错过了什么吗?

I recommend avoiding dodgy grep -P implementations and use the real thing. 我建议避免使用狡猾的grep -P实现并使用真实的东西。 This works: 这有效:

perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...

Where: 哪里:

  • The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded. -CSD选项表示stdio trio(stdin,stdout,stderr)和磁盘文件都应该被视为UTF-8编码。

  • The $. $. represents the current record (line) number. 代表当前记录(行)编号。

  • The $_ represents the current line. $_代表当前行。

  • The \\P{ASCII} matches any code point that is not ASCII. \\P{ASCII}匹配任何 ASCII的代码点。

A comment in How Do I grep For all non-ASCII Characters in UNIX gives the answer: 我如何grep中的注释对于UNIX中的所有非ASCII字符给出了答案:

Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. Grep(和系列)不进行Unicode处理,将多字节字符合并到单个实体中,以便进行正则表达式匹配。

That implies that the UTF-8 encoding for U+2013 ( 0xe2 , 0x80 , 0x93 ) is not treated by grep as parts of a single printable character outside the given range. 这意味着,UTF-8编码U+20130xe20x800x93 )不是通过作为grep的给定范围之外的单个打印字符的部分进行处理。

The GNU grep manual's d escription of -P does not mention Unicode or UTF-8. GNU grep手册-P没有提到Unicode或UTF-8。 Rather, it says Interpret the pattern as a Perl regular expression. 相反,它说将模式解释为Perl正则表达式。 (this does not mean that the result is identical to Perl, only that some of the backslash-escapes are similar ). (这并不意味着结果 Perl 相同 ,只是一些反斜杠转义类似 )。

Perl itself can be told to use UTF-8 encoding. 可以告诉 Perl本身使用UTF-8编码。 However the examples using Perl in Filtering invalid utf8 do not use that feature. 但是,在过滤无效的utf8中使用Perl的示例不使用该功能。 Instead, the expressions (like those in the problematic grep) test only the individual bytes -- not the complete character. 相反,表达式(如有问题的grep中的表达式)仅测试单个字节 - 而不是完整字符。

gawk can help you for this problem, gawk可以帮助你解决这个问题,

here is the awk one-liner: 这是awk单行:

 awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}
               {for(i=1;i<=NF;i++)if(!($i in ord))print $i}' file

below is a test with gawk: 以下是gawk的测试:

kent$  cat f
abcd
+ß
s+äö
ö--我
中文

kent$  awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}{for(i=1;i<=NF;i++)if(!($i in ord))print $i}' f
ß
ä
ö
ö
我
中
文

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM