简体   繁体   English

Bash grep regex问题有两个不同的文件

[英]Bash grep regex issue with two different files

I have the following command which is filtering 3-letters words from a file made of upper case words only - one word per line: 我有以下命令,该命令仅从大写单词组成的文件中过滤3个字母的单词-每行一个单词:

grep -E '^[A-Z]{3}$' test

The command returns a correct list of words when used with a file test containing 10 words. 与包含10个单词的文件test一起使用时,该命令返回正确的单词列表。 When applied to a much bigger file dico.txt containing over 30,000 words, the command does not return anything (a new prompt is simply displayed). 当应用于包含30,000个单词的更大的文件dico.txt ,该命令将不返回任何内容(仅显示新的提示)。

As I thought it might be either an extension or a file size issue, I've tried: 我以为可能是扩展名或文件大小问题,所以我尝试了:

  • cp test test.txt to match the big file *.txt extension cp test test.txt以匹配大文件*.txt扩展名
  • Create a new file dico_small.txt selecting 1000 lines from dico.txt 创建一个新文件dico_small.txtdico.txt选择1000行

...both without success. 都没有成功

Your large file has windows line endings, that is \\r\\n instead of linux line endings \\n . 您的大文件具有Windows行尾,即\\r\\n而不是Linux行尾\\n

\\r is called carriage return and is treated as a normal character by grep . \\r称为回车grep将其视为普通字符。 When you write grep -E "a$" fileWithWindowsLineEndings then grep won't find anything because in front of the linux line ending \\n (denoted as $ in grep) there is always a \\r and never an a . 当您编写grep -E "a$" fileWithWindowsLineEndings grep将找不到任何内容,因为在以\\n结尾的Linux行(在grep中表示为$ )的前面总是有一个\\r而不是a

You can convert your file to a normal linux file by deleting all \\r characters. 您可以通过删除所有\\r字符将文件转换为普通的linux文件。

tr -d '\r' < fileWithWindowsLineEndings > fileWithLinuxLineEndings
grep -E '...' fileWithLinuxLineEndings

Alternatively, convert the file on the fly without saving the conversion result: 或者,在不保存转换结果的情况下即时转换文件:

tr -d '\r' < fileWithWindowsLineEndings | grep -E '...'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM