如何格式化bash SED/AWK/Per output做進一步處理

Question

我有一些文本文件數據，我正在使用 SED、AWK 和 Perl 進行解析。

product {
    name { thing1 }
    customers {        
        mary { }
        freddy { }
        bob {
            spouse betty
        }
    }
}

從“客戶”部分，我試圖獲得類似於以下內容的 output：

mary{ }
freddy{ }
bob{spouse betty}

使用： sed -n -e "/customers {/,/}/{/customers {/d;/}/d;p;}" $file'

這是 output：

mary { }
freddy { }
bob {
    spouse betty
}

如何將“bob”客戶連接到一行並刪除多余的空格？ 這個特定的 output 的主要原因是我正在編寫一個腳本來抓取文本文件中的“客戶”字段和其他字段，然后將它們輸出到一個 csv 文件中。 看起來像這樣。 我知道用另一種語言可能會更容易，但 bash 是我所知道的。

output.csv
product,customers,another_column
thing1,mary{ } freddy{ } bob{spouse betty},something_else

Answer 1

數據恰好具有有效的tcl列表語法：

set f [open "input.file"]
set data [dict create {*}[read $f]]
close $f

set name [string trim [dict get $data product name]]
dict for {key val} [dict get $data product customers] {
    lappend customers [format "%s{%s}" $key [string trim $val]]
}

set f [open "output.csv" w]
puts $f "product,customers,another_column"
puts $f [join [list $name [join $customers] "something_else"] ,]
close $f

創建 output.csv 與

product,customers,another_column
thing1,mary{} freddy{} bob{spouse betty},something_else

Answer 2

編輯見結尾以生成完整的所需 output

這是它的正則表達式，可能幾乎是任何語言，以字符串的形式在整個文件上運行。 就目前而言，這假設客戶下只能有一層嵌套，換句話說， bob不能有{ pets { dog } }或類似的東西。

提取customers部分的內容

/customers\s*{\s* ( (?: [^{]+ {[^}]*} )+ )/x;

然后將換行符+空格折疊成一個空格

s/\n\s+/ /g;

然后從bob { spouse }類的字符串中修剪空格，但不從mary { }

s/{\s+ ([^}]+) \s+}/{$1}/gx;

如果bob和 crew 真的只能是單詞字符，那么我們可以使用更好的\w而不是[^{}] 。

總而言之，在一個 Perl 命令行程序中似乎是需要的

perl -wE'die "file?\n" if not @ARGV; 
    $d = do { local $/; <> };
    ($c) = $d =~ /customers\s*{\s* ( (?: [^{]+ {[^}]*} )+ )/x; 
    $c =~ s/\n\s+/ /g;          
    $c =~ s/{\s+ ([^}]+) \s+}/{$1}/gx; 
    say $c
' data.txt

對於問題中給出的數據，此打印

mary { } freddy { } bob {spouse betty}

例如，要在單獨的行上打印每個客戶可以這樣做

say for split /(?<=\})\s+/, $c;

（成為代碼的最后一行）

我現在意識到還有更多內容需要捕獲和打印，如上一段所述。 添加到正則表達式的開頭以捕獲name ，並添加所需的打印

perl -wE'die "file?\n" if not @ARGV; 
    $d = do { local $/; <> };
    ($n, $c) = $d =~ /name\s*{\s* ([^}]+) \s*} .*?  customers\s*{\s* ( (?: [^{]+ {[^}]*} )+ )/sx; 
    $n =~ s/^\s+|\s+$//g;
    $c =~ s/\n\s+/ /g;
    $c =~ s/{\s+ ([^}]+) \s+}/{$1}/gx; 
    say "product,customers,another_column"
    say "$n,$c,something_else"
' data.txt > output.csv

Output 重定向到output.csv如問題所示。

Answer 3

僅限您展示的樣品。 在 GNU awk ，您可以嘗試使用以下awk代碼。 我們可以在單個 GNU awk中完成，我們不需要將您的sed命令的 output 傳遞給任何其他工具。 只需將您的 Input_file 傳遞給這個awk程序。

第一個解決方案：要在customers部分之間獲取 output 到}它的右括號和沒有起始空格的值，請嘗試遵循 GNU awk解決方案。

awk -v RS='\n[[:space:]]+customers {[[:space:]]*.*\n[[:space:]]+}' '
RT{
  sub(/^\n[[:space:]]+[^ ]* {[[:space:]]*\n/,"",RT)
  sub(/\n[[:space:]]+}/,"",RT)
  match(RT,/(.*{)[[:space:]]*([^\n]*)(.*)/,arr)
  sub(/^[[:space:]]+/,"",arr[1])
  sub(/\n/,"",arr[2])
  gsub(/\n|^[[:space:]]+/,"",arr[3])
  gsub(/\n[[:space:]]+/,"\n",arr[1])
  gsub(/ {/,"{",arr[1])
  print arr[1] arr[2] arr[3]
}
'   Input_file

Output 將如下所示：

mary{ }
freddy{ }
bob{spouse betty}

第二種解決方案：要在值之前有起始空格，請嘗試使用以下代碼。

awk -v RS='\n[[:space:]]+customers {[[:space:]]*.*\n[[:space:]]+}' '
RT{
  sub(/^\n[[:space:]]+[^ ]* {[[:space:]]*\n/,"",RT)
  sub(/\n[[:space:]]+}/,"",RT)
  match(RT,/(.*{)[[:space:]]*([^\n]*)(.*)/,arr)
  sub(/\n/,"",arr[2])
  gsub(/\n|^[[:space:]]+/,"",arr[3])
  print arr[1] arr[2] arr[3]
}
'   Input_file

Output 將如下所示：

        mary { }
        freddy { }
        bob {spouse betty}

解釋：簡單的解釋是在 GNU awk RS（記錄分隔符）設置為\n[[:space:]]+customers {[[:space:]]*.*\n[[:space:]]+}到僅匹配必需的匹配項。然后在這個awk程序的主塊中，根據sub （替代函數）的要求刪除所有不必要的（不需要的字符串部分），然后使用match function 和正則表達式(.*{)[[:space:]]*([^\n]*)(.*)和 3 個捕獲組，它們的值被存儲到一個名為arr的數組中，然后我用它替換所有換行符/空格，然后用 RT 打印當前行的值。

Answer 4

以下代碼示例演示了所提供示例數據的最原始解析器。

此代碼恢復數據結構，然后可以以任何可以想象的方式使用，例如存儲為CVS 、 JSON 、 YAML文件。

在現實生活中，輸入數據可能完全不同，這段代碼可能無法正確處理它。

提供的代碼僅用於教育目的。

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $data = do { local $/; <DATA> };

$data =~ s/\n/ /g;
$data =~ s/ +/ /g;

say Dumper parse($data);

exit 0;

sub parse {
    my $str  = shift;   
    my $ret;

    while( $str =~ /^(\S+) \{ (\S+) \{ \S+/ ) {
        if( $str =~ /^(\S+) \{ (\S+) \{ ([^}]+?) \{(.+?)\}/ ) {
            $ret->{$1}{$2}{$3} = $4;
            $ret->{$1}{$2}{$3} =~ s/(^\s+|\s+$)//g;
            $str =~ s/^(\S+) \{ (\S+) \{(.+?)\{(.*?)\}/$1 \{ $2 \{/;
        }
        if( $str =~ /^(\S+) \{ (\S+) \{\s*([^{]+?)\s*\}/ ) {
            $ret->{$1}{$2} = $3 if length($3) > 1;
            $str =~ s/^(\S+) \{ \S+ \{\s*[^\}]+\s*\}/$1 \{/;
        }
    }
    
    return $ret;
}

__DATA__
product {
    name { thing1 }
    customers {        
        mary { }
        freddy { }
        bob {
            spouse betty
        }
    }
}

Output

$VAR1 = {
          'product' => {
                         'customers' => {
                                          'bob' => 'spouse betty',
                                          'freddy' => '',
                                          'mary' => ''
                                        },
                         'name' => 'thing1'
                       }
        };

Answer 5

也許ed

ed -s file.txt <<-'EOF'
  %s/^[[:space:]]*//
  ?{?;/^}/j
  %s/^\([^\{]*\) \(.*\)$/\1\2 /
  /^customers/+1;/^}/-1j
  s/^/thing1,/
  s/ *$/,someting_else/
  p
  Q
EOF

使用臨時文件，寫入新文件會更容易一些。

ed -s file.txt <<-'EOF'
  %s/^[[:space:]]*//
  /customers {/+1;/^[[:space:]]*}/w out.txt
  %d
  r out.txt
  ?{?;/^}/j
  %s/^\([^\{]*\) \(.*\)$/\1\2 /
  %j
  s/^/thing1,/
  s/ *$/,someting_else/
  0a
product,customers,another_column
.
  w output.csv
  ,p
  Q
EOF

后者創建兩個文件， out.txt和output.csv
如果不需要標准輸出 output，請刪除,p 。

Answer 6

輸入文件在這里稱為“堆棧”。

#!/bin/sh -x

cat > ed1 <<EOF
/customers/
+1
ka
$
-2
kb
'a,'bW output.txt
q
EOF

ed -s stack < ed1

如何格式化bash SED/AWK/Per output做進一步處理

問題描述

6 個解決方案

解決方案1
4 已采納 2022-10-05 20:50:56

解決方案2
4 2022-10-05 21:43:13

解決方案3
1 2022-10-05 18:57:43

解決方案4
1 2022-10-05 22:54:18

解決方案5
0 2022-10-05 19:57:50

解決方案6
0 2022-10-08 02:46:07

如何格式化bash SED/AWK/Per output做進一步處理

問題描述

6 個解決方案

解決方案1 4 已采納 2022-10-05 20:50:56

解決方案2 4 2022-10-05 21:43:13

解決方案3 1 2022-10-05 18:57:43

解決方案4 1 2022-10-05 22:54:18

解決方案5 0 2022-10-05 19:57:50

解決方案6 0 2022-10-08 02:46:07

解決方案1
4 已采納 2022-10-05 20:50:56

解決方案2
4 2022-10-05 21:43:13

解決方案3
1 2022-10-05 18:57:43

解決方案4
1 2022-10-05 22:54:18

解決方案5
0 2022-10-05 19:57:50

解決方案6
0 2022-10-08 02:46:07