用於合並具有匹配第一個字段的行的命令行，50 GB 輸入

Question

不久前，我問了一個關於合並具有共同第一個字段的行的問題。 這是原始的：命令行匹配具有匹配第一個字段（sed、awk 等）的行

樣本輸入：

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

期望的輸出：

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

這個想法是，如果第一個字段匹配，則合並行。 輸入已排序。 實際內容更復雜，但使用管道作為唯一分隔符。

上一個問題中提供的方法在我的 0.5GB 文件上運行良好，處理時間約為 16 秒。 但是，我的新文件大約大 100 倍，我更喜歡流式傳輸的方法。 理論上，這將能夠在大約 30 分鍾內運行。 先前的方法在運行 24 小時后未能完成。

在 MacOS（即 BSD 類型的 unix）上運行。

想法？ [注意，先前問題的先前答案不是單行的。]

Answer 1

您可以將結果即時附加到文件中，這樣您就不需要構建 50GB 的數組（我假設您沒有內存！）。 此命令將連接字符串中每個不同索引的連接字段，該字符串被寫入以相應索引命名的文件中，並帶有一些后綴。

編輯：根據 OP 的評論，內容可能有空格，我建議使用-F"|" 而不是sub並且以下答案旨在寫入標准輸出

（新）代碼：

# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'

這會在給定索引中構建一串“數據”，然后在索引更改時將其打印出來並開始在新索引上構建下一個字符串，直到該字符串結束......重復......

Answer 2

sed '# label anchor for a jump
   :loop
# load a new line in working buffer (so always 2 lines loaded after)
   N
# verify if the 2 lines have same starting pattern and join if the case
   /^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
   $ b
# if lines are joined, cycle and re make with next line (jump to :loop)
   t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
   /.*|.*|.*\n/ P
# remove first line (using last search pattern)
   s///
# (if anay modif) cycle (jump to :loop)
   t loop
# exit and print working buffer
   ' YourFile

posix 版本（在 Mac 上可能是 --posix）
自我評論
假設已排序條目，沒有空行，數據中沒有管道（也沒有轉義）
如果可用，將 unbufferd -u用於流進程

用於合並具有匹配第一個字段的行的命令行，50 GB 輸入

問題描述

2 個解決方案

解決方案1
2 2015-07-30 18:06:49

解決方案2
0 2015-07-31 07:56:49

用於合並具有匹配第一個字段的行的命令行，50 GB 輸入

問題描述

2 個解決方案

解決方案1 2 2015-07-30 18:06:49

解決方案2 0 2015-07-31 07:56:49

解決方案1
2 2015-07-30 18:06:49

解決方案2
0 2015-07-31 07:56:49