文件中的grep模式，打印模式而不是匹配的字符串

Question

我想用包含正則表達式的文件中的模式進行grep。 模式匹配時，將打印匹配的字符串，但不打印模式。 如何獲得模式而不是匹配的字符串？

pattern.txt

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Donut Gorilla Chocolate
Chocolate (English|Fall) apple gorilla
gorilla chocolate (apple|ball)
(ball|donut) apple

strings.txt

apple ball Donut
donut ball chocolate
donut Ball Chocolate
apple donut
chocolate ball Apple

這是grep命令

grep -Eix -f pattern.txt strings.txt

此命令從strings.txt打印匹配的字符串

apple ball Donut
donut ball chocolate
donut Ball Chocolate

但我想從pattern.txt中找到用於匹配的模式

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

pattern.txt可以是小寫字母，大寫字母，帶正則表達式的行和不帶正則表達式的行，自由行的單詞和正則表達式元素。 除了括號和管道，沒有其他種類的正則表達式。

我不想使用循環來讀取grep的每一行的pattern.txt，因為它很慢。 是否可以在grep命令中打印模式文件的哪個模式或行號？ 還是grep以外的其他命令可以使工作不會太慢？

Answer 1

使用grep我不知道，但是使用GNU awk：

$ awk '
BEGIN { IGNORECASE = 1 }      # for case insensitivity
NR==FNR {                     # process pattern file
    a[$0]                     # hash the entries to a
    next                      # process next line
}
{                             # process strings file
    for(i in a)               # loop all pattern file entries
        if($0 ~ "^" i "$") {  # if there is a match (see comments)
            print i           # output the matching pattern file entry
            # delete a[i]     # uncomment to delete matched patterns from a
            # next            # uncomment to end searching after first match
        }
}' pattern strings

輸出：

D (A|B) C

對於strings每一行，腳本將循環每一行pattern以查看是否存在多個匹配項。 由於區分大小寫，只有一個匹配項。 例如，您可以使用GNU awk的IGNORECASE 。

另外，如果希望每個匹配的一個模式文件條目輸出一次，則可以從第一個匹配后的a刪除它們：在print之后添加delete a[i] 。 這也可能會給您帶來一些性能優勢。

Answer 2

編輯：由於OP更改了Input_file，所以現在也根據更改的Input_file添加解決方案。

awk '
FNR==NR{
   a[toupper($1),toupper($NF)]
   b[toupper($2)]
   next
}
{
   val=toupper($2)
   gsub(/\)|\(|\|/," ",val)
   num=split(val,array," ")
   for(i=1;i<=num;i++){
      if(array[i] in b){
        flag=1
        break
      }
   }
}
flag && ((toupper($1),toupper($NF)) in a){
  print;
  flag=""
}' string pattern

輸出如下。

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

解決方案一：添加一個通用解決方案，假設您的Input_file命名pattern在第二個字段上具有兩個以上的值，例如-> (B|C|D|E)那么以下內容可能對您有所幫助。

awk '
FNR==NR{
   a[$1,$NF]
   b[toupper($2)]
   next
}
{
   val=$2
   gsub(/\)|\(|\|/," ",val)
   num=split(val,array," ")
   for(i=1;i<=num;i++){
      if(array[i] in b){
        flag=1
        break
      }
   }
}
flag && (($1,$NF) in a)
{
  flag=""
}' string pattern

解決方案2：請嘗試以下方法。 但嚴格考慮您的Input_file（s）是僅與所示示例相同的模式（在這里，我考慮到您的Input_file命名pattern在第二個字段中只有2個值）

awk '
FNR==NR{
  a[$1,$NF]
  b[toupper($2)]
  next
}
{
  val=$2
  gsub(/\)|\(|\|/," ",val)
  split(val,array," ")
}
((array[1] in b) || (array[2] in b)) && (($1,$NF) in a)
' string pattern

輸出如下。

A (B|C) D
D (A|B) C

Answer 3

您可以嘗試使用內置的bash：

$ cat foo.sh
#!/usr/bin/env bash

# case insensitive
shopt -s nocasematch

# associative array of patterns
declare -A patterns=()
while read -r p; do
    patterns["$p"]=1
done < pattern.txt

# read strings, test remaining patterns,
# if match print pattern and remove it from array    
while read -r s; do
    for p in "${!patterns[@]}"; do
        if [[ $s =~ ^$p$ ]]; then
            printf "%s\n" "$p"
            unset patterns["$p"]
        fi
    done
done < strings.txt
$ ./foo.sh
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

不確定性能，但是由於沒有子進程，因此它應該比為每個模式調用grep快得多。

當然，如果您有數百萬個模式，則將它們存儲在關聯數組中可能會耗盡可用內存。

Answer 4

也許切換范例？

while read pat
do grep -Eix "$pat" strings.txt >"$pat" &
done <patterns.txt

這將使文件名變得丑陋，但是每套文件都有清晰的列表。 如果願意，可以先清理文件名。 也許（假設模式很容易解析為唯一性...）

while read pat
do grep -Eix "$pat" strings.txt >"${pat//[^A-Z]/}" &
done <patterns.txt

它應該相當快，並且實現起來相對簡單。 希望能有所幫助。

文件中的grep模式，打印模式而不是匹配的字符串

問題描述

4 個解決方案

解決方案1
5 已采納 2018-08-13 12:29:07

解決方案2
1 2018-08-13 12:34:57

解決方案3
0 2018-08-13 12:48:28

解決方案4
0 2018-08-13 14:09:42

文件中的grep模式，打印模式而不是匹配的字符串

問題描述

4 個解決方案

解決方案1 5 已采納 2018-08-13 12:29:07

解決方案2 1 2018-08-13 12:34:57

解決方案3 0 2018-08-13 12:48:28

解決方案4 0 2018-08-13 14:09:42

解決方案1
5 已采納 2018-08-13 12:29:07

解決方案2
1 2018-08-13 12:34:57

解決方案3
0 2018-08-13 12:48:28

解決方案4
0 2018-08-13 14:09:42