简体   繁体   English

Grep,找到具有精确匹配模式数的行

[英]Grep, find lines with exact number of matching patterns

I want to find and list lines in text file that contain only two words that are four characters or more. 我想在文本文件中查找和列出仅包含两个字符(四个字符或更多)的行。

I can find words of four characters or more with: 我可以找到四个字符或更多的单词:

grep '[A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]*' file.txt

but how can I limit output to show only lines with two such words? 但是如何限制输出以仅显示带有两个这样的单词的行?

Any hints (not necessarily an answer)? 有什么提示(不一定是答案)吗?

thanks 谢谢

UPDATE: Thank you. 更新:谢谢。 After following your advice I'm now with: 遵循您的建议后,我现在使用:

egrep '([A-Za-z]){4,}' file.txt

That lists all the lines with highlighted words that are 4+ letters long. 列出所有带有突出显示的单词且长度超过4个字母的行。 Now I have only to filter it to show only the lines where such words (4+ letters long) occur twice. 现在,我只需要过滤它,以仅显示出现此类单词(长4个以上的字母)两次的行。 Any hints? 有什么提示吗?

To look for two instances of PATTERN , use: 要查找PATTERN两个实例,请使用:

PATTERN.*PATTERN

If you use grep -E you could use curly braces to avoid repetition: 如果使用grep -E ,则可以使用花括号来避免重复:

grep -E '(.*PATTERN){2,}'

(You could also apply the same trick to avoid repeating [A-Za-z] in your pattern.) (您也可以应用相同的技巧,以避免在模式中重复[A-Za-z] 。)

You can use \\< and \\> to match the beginning and end of words to make sure 8-letter words aren't detected as two 4-letter words. 您可以使用\\<\\>匹配单词的开头和结尾,以确保不会将8个字母的单词检测为两个4个字母的单词。

Just use awk so you don't have to come up with some convoluted regexp to do everything at once. 只需使用awk,您就不必想出一些复杂的正则表达式来一次完成所有操作。 With GNU awk for word boundaries and assuming your "words" only contain alphabetic characters as in your posted script: 使用GNU awk作为单词边界,并假设您的“单词”仅包含字母字符,如您发布的脚本中所示:

awk 'gsub(/\<[[:alpha:]]{4,}\>/,"&") == 2'

The above is untested, of course, since you didn't provide sample input/output for us to test against. 当然,以上内容未经测试,因为您没有提供样本输入/输出供我们测试。

EDIT: Here's the solution given on page 216 in the text you referenced in your comments to exercise 7.5 on page 100 which you based your question on: 编辑:这是在评论中引用的第216页给出的解决方案,以练习第100页7.5所基于的问题:

egrep '(\<[A-Za-z]{4,}\>).*\<\1\>' file

Let's first clean that up to remove the deprecated egrep and replace the character lists with a portable character class: 首先,我们进行清理以删除不建议使用的egrep并将字符列表替换为可移植字符类:

grep -E '(\<[[:alpha:]]{4,}\>).*\<\1\>' file

Now what you have is a script that rather than looking for lines that contain only two words that are four characters or more as stated in your question, looks for lines that contain the same 4-or-more character word occurring at least two times which is a very different and much simpler problem to solve. 现在,您所拥有的是一个脚本,而不是查找only two words that are four characters or more包含only two words that are four characters or more如问题中所述的only two words that are four characters or more行,而是查找包含相同的 4个或更多字符的单词的行至少出现两次,而不是是一个非常不同且要简单得多的问题。

1st: I recommend using \\w (letter) for letter, it's cleaner. 第一:我建议使用\\ w(字母)作为字母,这样更干净。
2nd: To group your pattern into a single token use () to find multiple copies of a regex token use {} . 第2个:要将您的模式分组为单个令牌,请使用()查找{}的正则表达式令牌的多个副本。 (see Cheat sheet) (请参阅备忘单)
3rd: In this case your delimiter is whitespace so I'd use \\s since I assume you might want to catch things like tabs. 第三:在这种情况下,您的分隔符为空格,因此我将使用\\s因为我假设您可能想捕获选项卡之类的东西。 But that's at your own discretion. 但这是您自己决定的。

Side note: I recommend avoiding * unless you have a strong delimiter (eg .* will greedy match to the end of your string). 旁注:我建议避免使用*除非您使用强定界符(例如.*将贪婪地匹配到字符串的末尾)。

Cheat sheet: https://www.rexegg.com/regex-quickstart.html 备忘单: https : //www.rexegg.com/regex-quickstart.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM