[英]How do I compare one column of a file with another column of another file using awk?
我有兩個文件如下:
文件1.txt
2018-03-14 13:23:00 CID [72883359]
2018-03-14 13:23:00 CID [275507537]
2018-03-14 13:23:00 CID [275507539]
2018-03-14 13:23:00 CID [207101094]
2018-03-14 13:23:00 CID [141289821]
和 file2.txt
2018-03-14 13:23:00 CID [207101072]
2018-03-14 13:23:00 CID [275507524]
2018-03-14 13:23:00 CID [141289788]
2018-03-14 13:23:00 CID [72883352]
2018-03-14 13:23:01 CID [72883359]
2018-03-14 13:23:00 CID [275507532]
我需要比較第一個文件的第 4 列和第二個文件的第 4 列。 我正在使用以下命令:
awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt
它的輸出如下。
2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01
上面的命令工作正常,但問題是 file1 和 file2 很大並且有大約 20k 行,因此上面的命令需要時間。
我想如果找到匹配項,它應該跳過剩余的列並繼續下一步,這意味着某種 break 語句。 請幫忙。
下面是我的腳本。
#!/bin/sh
cron=1;
for((j = $cron; j >= 1; j--))
do
d1=`date -d "$date1 $j min ago" +%Y-%m-%d`
d2=`date -d 'tomorrow' '+%Y-%m-%d'`
t1=`date -d "$date1 2 min ago" +%R`
t2=`date -d "$date1 1 min ago" +%R`
t3=`date --date="0min" +%R`
done
cat /prd/firewall/logs/lwsg_event.log | egrep "$d1|$d2" | egrep "$t1|$t2|$t3" | grep 'SRIR' | awk -F ' ' '{print $1,$2,$4,$5}'>file1.txt
cat /prd/firewall/logs/lwsg_event.log | egrep "$d1|$d2" | egrep "$t1|$t2|$t3" | grep 'SRIC' | awk -F ' ' '{print $1,$2,$4,$5}'>file2.txt
awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt
cat file3.txt | while read LINE
do
f1=`echo $LINE | cut -f 1 -d " "`
f2=`echo $LINE | cut -f 2 -d " "`
String1=$f1" "$f2
f3=`echo $LINE | cut -f 5 -d " "`
f4=`echo $LINE | cut -f 6 -d " "`
String2=$f3" "$f4
f5=`echo $LINE | cut -f 3 -d " "`
f6=`echo $LINE | cut -f 4 -d " "`
String3=$f5" "$f6
StartDate=$(date -u -d "$String1" +"%s")
FinalDate=$(date -u -d "$String2" +"%s")
echo "Diff for $String3 :" `date -u -d "0 $FinalDate sec - $StartDate sec" +"%H:%M:%S"` >final_output.txt
done
final_output.txt
將是
Diff for CID [142298410] : 00:00:01
Diff for CID [273089511] : 00:00:00
Diff for CID [273089515] : 00:00:00
Diff for CID [138871787] : 00:00:00
Diff for CID [273089521] : 00:00:00
Diff for CID [208877371] : 00:00:00
Diff for CID [138871793] : 00:00:00
Diff for CID [138871803] : 00:00:00
Diff for CID [273089526] : 00:00:00
Diff for CID [273089545] : 00:00:00
Diff for CID [208877406] : 00:00:02
Diff for CID [208877409] : 00:00:01
Diff for CID [138871826] : 00:00:00
Diff for CID [74659680] : 00:00:00
您能否嘗試關注awk
,如果這對您有幫助,請告訴我。
awk 'FNR==NR{a[$4]=$0;next} ($4 in a){print a[$4],$1,$2}' file1.txt file2.txt
您是否考慮過join
命令? 似乎沒有多少人知道加入。
NAME
join - join lines of two files on a common field
SYNOPSIS
join [OPTION]... FILE1 FILE2
您的整個腳本會多次讀取同一個文件,並且包含大量其他低效之處。
如果沒有適當的輸入來測試,很難驗證這一點,但這里有一個重構,它至少應該為進一步探索提供一個好的方向。
#!/bin/sh
cron=1;
for((j = $cron; j >= 1; j--))
do
# Replace obsolescent `backticks` with $(modern command substitution) syntax
d1=$(date -d "$date1 $j min ago" +%Y-%m-%d)
d2=$(date -d 'tomorrow' '+%Y-%m-%d')
t1=$(date -d "$date1 2 min ago" +%R)
t2=$(date -d "$date1 1 min ago" +%R)
t3=$(date --date="0min" +%R)
done
# Avoid useless cat and useless grep, fold everything into one Awk script
# See also http://www.iki.fi/era/unix/award.html
awk -v d="$d1|$d2" -v t="$t1|$t2|$t3" '
$0 !~ d {next} $0 !~ t { next }
{ o = "" }
/SRIR/ { o="file1.txt" }
/SRIC/ { o="file2.txt" }
o { {print $1,$2,$4,$5 > o; o="" }' /prd/firewall/logs/lwsg_event.log
awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt
# Avoid uppercase for private variables
# Use read -r always
# Let read split the line
while read -r f1 f2 f5 f6 f3 f4
do
String1=$f1" "$f2
String2=$f3" "$f4
String3=$f5" "$f6
StartDate=$(date -u -d "$String1" +"%s")
FinalDate=$(date -u -d "$String2" +"%s")
echo "Diff for $String3 :" $(date -u -d "0 $FinalDate sec - $StartDate sec" +"%H:%M:%S")
done <file3.txt >final_output.txt
我猜想主要的瓶頸是你多次處理了日志文件,而不是你在你尋求幫助的結果上運行的小 Awk 片段中。
這仍然可以重構為單個 Awk 腳本。 如果您有 GNU Awk,那么您也應該能夠在 Awk 中進行date
計算。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.