使用 awk 解析多行

Question

我有一個多行輸出，如下所示：

foo: some text
    goes here
    and here
    and here
bar: more text
    goes here
    and here
xyz: and more...
    and more...
    and more...

文本格式與此處顯示的完全相同。 我感興趣的文本的“組/部分”在行首之后開始，並在下一個文本在行首開始之前的行結束。

在這個例子中，grouls 將是foo和bar之前的所有文本。 然后bar和xyz之前的所有文本。 最后， xyz直到結束。

Answer 1

輸入

$ cat file
foo: some text
    goes here
    and here
    and here
bar: more text
    goes here
    and here
xyz: and more...
    and more...
    and more...

產量

$ awk '/:/{f=/^foo/}f' file
foo: some text
    goes here
    and here
    and here

如果您想跳過匹配的行，那么

$ awk '/:/{f=/^foo/;next}f' file
    goes here
    and here
    and here

甚至

# Just modify variable search value
# 1st approach
$ awk -v search="foo" '/:/{f=$0~"^"search}f' file
foo: some text
    goes here
    and here
    and here

# 2nd approach
$ awk -v search="foo" '/:/{f=$0~"^"search;next}f' file
    goes here
    and here
    and here

Answer 2

如果我正確地解釋了您的問題，您只想刪除空格並將foo放在除:之后的另一行之外。 這個awk腳本可以做到這一點：

awk 'BEGIN{RS="[:\n]"}{$1=$1}1' file

輸出：

foo
some text
goes here
and here
and here
bar
more text
goes here
and here
xyz
and more...
and more...
and more...

說明：

RS="[:\\n]表示行應在:或\\n處拆分
$1=$1將行重新處理為$0 （刪除行首的空白）
1表示每一行應該是帶有“默認操作”的進程，其print $0

Answer 3

正如其他人所說，解析數據后，您無需指定要處理的數據。

如果您只想提取特定的塊，則來自Akshay Hegde的答案應該可以正常工作。

如果您想使用更多的awk功能來處理每條記錄，例如以某種方式轉換輸出（例如將行連接在一起等），則可能需要一些不同的東西。

您可以通過幾種相當簡單的方法來執行此操作，但是我認為最好的方法可能是更改記錄分隔符。

使用正則表達式作為記錄分隔符的功能是gawk的擴展，但是如果您在Linux上，則可能正在使用gawk。

這是gawk程序文件“ prog.awk”的內容：

function process_group(name, body) {
    print "Got group with name '" name "'";
    print body;
}

BEGIN {
    RS="(\n|^)\\S+:"
    PREV=""
}

{
    if (PREV!="") {
        process_group(gensub(/\n?(\S+):/, "\\1", "", PREV), $0);
    }
    PREV=RT
}

您可以使用

gawk -f prog.awk input.txt

或者，您可以將整個內容放在gawk命令行上，但是如果格式正確，則更容易閱讀。

這個想法是，每次看到記錄分隔符時，它都會為您提供自上一個記錄分隔符或文件開頭以來的內容。 這意味着，第一次看到記錄分隔符時，它將調用帶有記錄分隔符“ foo：”和一個空主體的底部塊，第二次看到記錄分隔符時，它將調用帶有“ bar：”的塊，並且之間的內容“ foo：”和“ bar：”等

這意味着與每個塊相對應的記錄分隔符是前一個，而不是當前。 通過跟蹤“ PREV”變量中的上一個記錄分隔符，很容易處理。

因此，BEGIN塊設置記錄分隔符RS，並將PREV初始化為空。

對於每個由RS分隔的記錄，將調用底部的塊，並在文件末尾再次調用。

如果“ PREV”不為空，它將使用當前主體數據和先前的記錄分隔符（通過使用gensub的方式從PREV中去除不感興趣的位）來調用“ process_group”函數。 然后，它將當前匹配的記錄分隔符（RT）分配給PREV，以備下次使用。

在“ process_group”中，您可以對每個組進行所需的任何處理。 在這種情況下，我只是將它們打印出來，但是修改它以執行所需的操作應該很容易。

Answer 4

首先，如果只有一個部分，請使用@Akshay Hegde。 否則，如果您可以更改 RS，請關注@sheltond。 但是對於日志文件處理，我經常需要有時逐行提取，有些部分是多行提取，以便一些日志文件摘要盡可能短。

在這里，我通常在腦死模式上使用一些變體。 例如，假設我想

打印所有非條形部分的第一行，以及
還打印帶有額外細節的每個條形部分（此處連接線）

文件 print_bar_sections.awk ：

function bar_may_end_here() { # This check might happen in several places
    if(bar_started){
        print(bar_out); bar_out=""; bar_started=0;
    }
}

# Here, any section-begin match might be terminating a bar section
/^[a-z]*:/ {bar_may_end_here();}
    
# Match start of interesting section, this line always included
/^bar:/ {bar_started=1; bar_out=$0; next;}
    
# Pehaps modify, skip interior lines?
#    bar_started==1 && /goes/ {bar_out = bar_out "GOES-LINE"; next;}
# Here, join lines
bar_started==1 {bar_out = bar_out $0; next;}

# Here we know we are not in a bar-section.
# For example, we might have single-line "interesting lines"
/error/ {print; next;}
/warning/ {print; next;}

# EOF might also terminate an active bar section
# (for logfiles you might know this is impossible)
END { bar_may_end_here(); }

根據需要調整此模式。 awk 以空字符串和變量 0 開頭。 next命令在為日志文件處理創建此類節提取器時特別有用。

有時，這種創建狀態機變量（如bar_started和狀態信息（如bar_out字符串）的方法可以允許更復雜的 awk 程序。 例如，狀態變量可能需要比 0 或 1 多的值，並且存儲的狀態信息可能更復雜（數組或多個變量）。 享受！

使用 awk 解析多行

問題描述

4 個解決方案

解決方案1
2 2017-02-20 17:07:26

解決方案2
0 2017-02-20 20:00:17

解決方案3
0 2017-02-21 09:46:28

解決方案4
0 2021-12-03 14:55:59

使用 awk 解析多行

問題描述

4 個解決方案

解決方案1 2 2017-02-20 17:07:26

解決方案2 0 2017-02-20 20:00:17

解決方案3 0 2017-02-21 09:46:28

解決方案4 0 2021-12-03 14:55:59

解決方案1
2 2017-02-20 17:07:26

解決方案2
0 2017-02-20 20:00:17

解決方案3
0 2017-02-21 09:46:28

解決方案4
0 2021-12-03 14:55:59