從每個連續的匹配行組中提取第一行

Question

我有一個看起來像這樣的數據文件：

a separator
interesting line 1
interesting line 2
a comment
interesting line 3
interesting line 4
interesting line 5
a non interesting line
some other data
interesting line 6
.
.
.

我想從每個連續的組中提取第一條interesting line ，無論組中有多少行或有多少額外的interesting line組分開。

對於上面的測試輸入，輸出將是：

interesting line 1
interesting line 3
interesting line 6

我可以在 python 中輕松地做到這一點，它有一個狀態變量，當我匹配一行時觸發，當我遇到不匹配的行時重置，但是單行 shell 腳本呢？ 有沒有一種不太晦澀的方法來做到這一點？

Answer 1

您可以將 grep 與貪婪的正則表達式一起使用，然后使用以下命令打印每個匹配項的第一行：

grep -Pzo '([^\n]*interesting line[^\n](\n|$))+' file |
  while IFS='' read -d '' -r match
  do
    head -n1 <<< "$match"
  done

grep參數：

-P ：對正則表達式中的 \\n 使用 Perl Compatible 正則表達式（而不是默認的基本正則表達式）。
-z ：將輸入視為一組行，每行以零字節結尾。 ASCII NUL 字符將分隔每個匹配項，使我們能夠可靠地分隔匹配項。
正則表達式([^\\n]*blablabla[^\\n]*(\\n|$))+將匹配包含 blablabla 的每組連續行。

在 while 條件命令中， IFS 被清空以進行read 。 否則，使用默認 IFS，每個匹配項的最后一個換行符將被read （這可能不是問題）。 始終在“讀取時”中清除 IFS 以完全讀取變量中的文本是一個很好的做法（前導空格也很容易被占用）。

read參數：

-d '' ：使用空字符串作為分隔符（= ASCII NUL 字符）。 這相當於-d $'\\0' （參見https://unix.stackexchange.com/q/61029/283498 ）。
-r ：不要解釋行中的任何反斜杠（參見https://unix.stackexchange.com/q/192786/283498 ）。
match ：只是我選擇的一個變量名，用於循環體。

在循環體中： head -n1 <<< "$match"僅打印當前匹配項的第一行（帶有-n 1的命令head打印其輸入的前 1 行）。 旁注： <<<是一種 bashism ； 該命令相當於echo "$match" | head -n1 echo "$match" | head -n1 。

從每個連續的匹配行組中提取第一行

問題描述

1 個解決方案

解決方案1
1 2021-10-12 19:52:21

從每個連續的匹配行組中提取第一行

問題描述

1 個解決方案

解決方案1 1 2021-10-12 19:52:21

解決方案1
1 2021-10-12 19:52:21