如何在while循環中並行處理多行文件？

Question

input.txt ：我的實際輸入文件有 5000000 行

A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06

data.txt ：這是我需要在 while 循環中引用的另一個文件。

A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30

這是我需要做的

cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo  $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done

期望的輸出：

A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

我的主要問題是 while 循環需要一個多小時才能完成 500000 行輸入。 我如何並行處理這個，因為每一行都是相互獨立的，並且輸出文件中的行順序無關緊要。 我根據一些討論嘗試過使用 GNU 並行。 但是它們都沒有幫助，或者我不確定如何實現它。 我將 RHEL 與 BASH 或 KSH 一起使用。

Answer 1

這是一種潛在的解決方案：

cat script.awk
#!/usr/bin/awk -f

NR==FNR{
  n=gsub("\.*","",$4)
  a[n,$5]=$6; next
} (n,$5) in a {
  print $0, a[n,$5]
}

cat input.txt | parallel --pipe -q ./script.awk data.txt -
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

它應該相對較快。 您可以根據您的參數（即每個文件的大小、可用內核的數量等）調整並行命令（例如，使用--pipepart 而不是--pipe）以提高性能。

編輯

粗略的基准測試表明它會明顯更快：

# Copy input.txt many times
for f in {1..100}; do cat input.txt >> input.txt_2; done
for f in {1..1000}; do cat input.txt_2 >> input.txt_3; done
for f in {1..10}; do cat input.txt_3 >> input.txt_4; done

du -h input.txt_4
137M    input.txt_4

wc -l input.txt_4
6000000 input.txt_4

time cat input.txt_4 | parallel --pipe -q ./script.awk data.txt - > output.txt
real    0m7.533s
user    0m22.085s
sys     0m4.494s

處理 6M 行輸入文件的時間不到 10 秒。 這能解決你的問題嗎？

Answer 2

5068056 行沒有並行需要 8 秒

$ wc -l input.txt 
5068056 input.txt
$ time awk 'NR==FNR{a[$4]=$6} NR!=FNR{print $0, a[$4]}' data.txt input.txt  > output.txt

real    0m8.274s
user    0m5.397s
sys     0m2.869s

$ wc -l output.txt
5068056 output.txt

與並行

time cat input.txt | parallel --pipe -q awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' data.txt - > output.txt 

real    0m3.319s
user    0m9.284s
sys     0m5.990s

使用拆分

inputfile=input.txt
outputfile=output.txt
data=data.txt
count=10

split -n l/$count $inputfile /tmp/input$$
for file in /tmp/input$$*; do
    awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' $data $file > ${file}.out &
done
wait
cat /tmp/input$$*.out > $outputfile
rm /tmp/input$$*

$ time ./split.sh

real    0m1.781s
user    0m7.244s
sys     0m1.536s

Answer 3

如果你開發了一個函數來為每次迭代做你需要的任何事情，你可以使用nohup 。

在下面的示例中，我模擬了迭代，讀取了一個 input.txt 文件。 我創建了一個do_something.sh ，它使用mode順序和並行調用。 我使用日期和日志來打印處理日期。 此外，我在每次迭代中模擬 2 秒的處理延遲。

腳本.sh

#!/bin/bash
mode=$1
log_file=log.txt
echo "" > $log_file

while read folder; do
  if [ "$mode" == "parallel" ] ;then
    nohup $(pwd)/do_something.sh $folder >/dev/null 2>&1 &
  else
    $(pwd)/do_something.sh $p    
  fi
done <input.txt

do_something.sh

log_file=log.txt
sleep 2
echo "$(date) : $1" >> $log_file

行.txt

aaaa
bbbb
cccc
dddd

此外，如果您想避免使用另一個腳本，您可以使用它來僅保留一個腳本：

https://stackoverflow.com/a/23877183/3957754

如何在while循環中並行處理多行文件？

問題描述

3 個解決方案

解決方案1
1 2022-06-07 05:49:17

編輯

解決方案2
1 2022-06-07 06:08:35

解決方案3
0 2022-06-07 06:05:01

如何在while循環中並行處理多行文件？

問題描述

3 個解決方案

解決方案1 1 2022-06-07 05:49:17

編輯

解決方案2 1 2022-06-07 06:08:35

解決方案3 0 2022-06-07 06:05:01

解決方案1
1 2022-06-07 05:49:17

解決方案2
1 2022-06-07 06:08:35

解決方案3
0 2022-06-07 06:05:01