遞歸列出包含 m of n 正則表達式的文件

Question

我有一個包含很多文件的目錄。 我有n搜索模式，並想列出與其中m個匹配的所有文件。

示例：從下面的文件中，列出至少包含str1 、 str2 、 str3和str4中的兩個的文件。

$ ls -l dir/
total 16
-rw-r--r--. 1 me me 10 Jun 22 14:22 a
-rw-r--r--. 1 me me  5 Jun 22 14:22 b
-rw-r--r--. 1 me me 10 Jun 22 14:22 c
-rw-r--r--. 1 me me  9 Jun 22 14:22 d
-rw-r--r--. 1 me me 10 Jun 22 14:22 e
$ cat dir/a
str1
str2
$ cat dir/b
str2
$ cat dir/c
str2
str3
$ cat dir/d
str
str4
$ cat dir/e
str2
str4

我設法通過一個相當丑陋for循環find實現這一點，該循環為每個文件生成n grep進程，這顯然是超級低效的，並且在包含大量文件的目錄上會花費很長時間：

for f in $(find dir/ -type f); do
  c=0
  grep -qs 'str1' $f && let c++
  grep -qs 'str2' $f && let c++
  grep -qs 'str3' $f && let c++
  grep -qs 'str4' $f && let c++
  [[ $c -ge 2 ]] && echo $f
done

我很確定我可以以更好的方式實現這一目標，但我不知道如何解決它。 根據我從手冊頁（即-e和-m ）中了解到的情況，僅使用grep是不可能的。

什么是正確的工具？ 這可能與awk嗎？

獎勵：通過使用find我可以更精確地定義要搜索的文件（即-prune某些子目錄或僅使用-iname '*.txt'搜索文件），我也想使用其他解決方案。

更新

下面是一些關於不同實現的性能的統計數據。

`find` + `awk`

（來自這個答案的腳本）

real    0m0,006s
user    0m0,002s
sys     0m0,004s

`python`

（我是python ，請告知是否可以優化）：

import os

patterns = []
patterns = ["str1", "str2", "str3", "str4"]

for root, dirs, files in os.walk("dir"):
    for file in files:
        c = int(0)
        filepath = os.path.join(root, file)
        with open(filepath, 'r') as input:
            for pattern in patterns:
                for line in input:
                    if pattern in line:
                        c += 1
                        break
        if ( c >= 2 ):
            print(filepath)

real    0m0,025s
user    0m0,019s
sys     0m0,006s

`c++`

（來自這個答案的腳本）

real    0m0,002s
user    0m0,001s
sys     0m0,001s

Answer 1

$ cat reg.txt
str1
str2
str3
str4

$ cat prog.awk
# reads regexps from the first input file
# parameterized by `m'
# requires gawk or mawk for `nextfile'
FNR == NR {
  reg[NR] = $0
  next
}
FNR == 1 {
  for (i in reg)
    tst[i]
  cnt = 0
}
{
  for (i in tst) {
    if ($0 ~ reg[i]) {
      if (++cnt == m) {
        print FILENAME
        nextfile
      }
      delete tst[i]
    }
  }
}

$ find dir -type f -exec awk -v m=2 -f prog.awk reg.txt {} +
dir/a
dir/c

Answer 2

由於編程語言不如性能重要，這里有一個 C++ 版本。 不過，我自己還沒有將它與awk進行比較。

#include <cstddef>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>
#include <string_view>
#include <utility>
#include <vector>

namespace fs = std::filesystem;

int main() {
    const fs::path dir = "dir";
    std::vector<std::string_view> strs{   // or: std::array<std::string_view, 4>
        "str1",
        "str2",
        "str3",
        "str4",
    };

    std::string line;
    int count;     // matches in a file
    size_t strsco; // number of strings to check in strs

    // a lambda to find a match on a line
    auto matcher = [&](const fs::directory_entry& de) {
        for(size_t idx = 0; idx < strsco; ++idx) {
            if(line.find(strs[idx]) != std::string::npos) {
                // a match was found

                if(++count >= 2) {
                    std::cout << de.path() << '\n';
                    // or the below if the quotation marks surrounding the path are
                    // unwanted:
                    // std::cout << de.path().native() << '\n';
                    return false;
                }

                // swap the found string_view with the last in the vector
                // to remove it from future matches in this file.
                --strsco;
                std::swap(strs[idx], strs[strsco]);
            }
        }
        return true;
    };

    // do a "find dir -type f"
    for(const fs::directory_entry& de : fs::recursive_directory_iterator(dir)) {
        if(de.is_regular_file()) { // -type f

            // open the found file
            if(std::ifstream file(de.path()); file) {
                // reset counters
                count = 0;
                strsco = strs.size();
                // read line by line until the file stream is depleated or matcher()
                // returns false
                while(std::getline(file, line) && matcher(de));
            }
        }
    }
}

將其保存到prog.cpp並像這樣編譯（如果您有g++ ）：

g++ -std=c++17 -O3 -o prog prog.cpp

如果您使用其他編譯器，請務必打開優化速度，並且它需要 C++17。

Answer 3

這是一個使用awk的選項，因為您也用它標記了它：

find dir -type f -exec \
awk '/str1|str2|str3|str4/{c++} END{if(c>=2) print FILENAME;}' {} \;

然而，它會計算重復項，所以一個文件包含

str1
str1

將被列出。

遞歸列出包含 m of n 正則表達式的文件

問題描述

更新

`find` + `awk`

`python`

`c++`

3 個解決方案

解決方案1
3 2020-06-22 13:56:58

解決方案2
2 已采納 2020-06-22 17:13:46

解決方案3
1 2020-06-22 13:59:52

遞歸列出包含 m of n 正則表達式的文件

問題描述

更新

find + awk

python

c++

3 個解決方案

解決方案1 3 2020-06-22 13:56:58

解決方案2 2 已采納 2020-06-22 17:13:46

解決方案3 1 2020-06-22 13:59:52

`find` + `awk`

`python`

`c++`

解決方案1
3 2020-06-22 13:56:58

解決方案2
2 已采納 2020-06-22 17:13:46

解決方案3
1 2020-06-22 13:59:52