如何讓“grep -zoP”分別顯示每個匹配項？

Question

我在這個表格上有一個文件：

X/this is the first match/blabla
X-this is
the second match-

and here we have some fluff.

我想提取出現在“X”之后和相同標記之間的所有內容。 所以如果我有“X+match+”，我想得到“match”，因為它出現在“X”之后和標記“+”之間。

因此，對於給定的示例文件，我希望得到以下輸出：

this is the first match

進而

this is
the second match

我設法使用以下方法獲取 X 和標記之間的所有內容：

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file

那是：

grep -Po '(?<=X(.))(.|\\n)+(?=\\1)'匹配 X 后跟(something)被捕獲並在最后與(?=\\1)匹配（我的代碼基於我的答案 here ）。
請注意，我使用(.|\\n)來匹配任何內容，包括新行，並且我還在 grep 中使用-z來匹配新行。

所以這很有效，唯一的問題來自輸出的顯示：

$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match

如您所見，所有匹配項一起出現，“這是第一個匹配項”后跟“這是第二個匹配項”，完全沒有分隔符。 我知道這來自“-z”的使用，它將所有文件視為一組行，每行都以零字節（ASCII NUL 字符）而不是換行符（引用“man grep”）結尾。

那么：有沒有辦法分別獲得所有這些結果？

我也在 GNU Awk 中嘗試過：

awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file

但甚至(\\n|.*)工作。

Answer 1

awk不支持正則表達式定義中的反向引用。

解決方法：

$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match

# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match

也可以使用(?s)X(.)\\K.+(?=\\1)而不是(?s)(?<=X(.)).+(?=\\1) 。 此外，您可能希望在此處使用非貪婪量詞以避免匹配match+xyz+foobaz用於輸入X+match+xyz+foobaz+

用perl

$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

Answer 2

這是另一個使用RS和RT gnu-awk 解決方案：

awk -v RS='X.' 'ch != "" && n=index($0, ch) {
   print substr($0, 1, n-1)
}
RT {
   ch = substr(RT, 2, 1)
}' file

this is the first match
this is
the second match

Answer 3

使用用於多字符 RS、RT 和 gensub() 的 GNU awk，而無需將整個文件讀入內存：

$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>

顯然，我添加了“<”和“>”，因此您可以看到每個輸出記錄的開始/結束位置。

以上假設X之后的字符不是非重復的正則表達式元字符（例如. 、 ^ 、 [等）所以 YMMV

Answer 4

用例有點問題，因為一旦打印匹配項，就會丟失有關分隔符確切位置的信息。 但是，如果這是可以接受的，請嘗試通過管道傳輸到xargs -r0 。

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0

這些選項是 GNU 擴展，但grep -z和（主要是） grep -P也是如此，所以這也許是可以接受的。

Answer 5

GNU grep -z用空字符終止輸入/輸出記錄（與sort -z等其他工具結合使用）。 pcregrep 不會這樣做：

pcregrep -Mo2 '(?s)X(.)(.+?)\1' file

-o number使用-o number代替環視。 ? 添加了惰性量詞（以防\\1稍后出現）。

如何讓“grep -zoP”分別顯示每個匹配項？

問題描述

5 個解決方案

解決方案1
5 2020-11-23 13:16:23

解決方案2
4 2020-11-23 15:21:40

解決方案3
3 2020-11-23 14:51:37

解決方案4
2 已采納 2020-11-23 13:23:11

解決方案5
1 2020-11-24 03:09:35

如何讓“grep -zoP”分別顯示每個匹配項？

問題描述

5 個解決方案

解決方案1 5 2020-11-23 13:16:23

解決方案2 4 2020-11-23 15:21:40

解決方案3 3 2020-11-23 14:51:37

解決方案4 2 已采納 2020-11-23 13:23:11

解決方案5 1 2020-11-24 03:09:35

解決方案1
5 2020-11-23 13:16:23

解決方案2
4 2020-11-23 15:21:40

解決方案3
3 2020-11-23 14:51:37

解決方案4
2 已采納 2020-11-23 13:23:11

解決方案5
1 2020-11-24 03:09:35