简体   繁体   中英

Using xargs, for each line in file count the occurrences of those lines another

This is a question in a linux (centos 7) academic module that I have found myself stuck on. I have a file of IP's, extracted from a log file. I have sorted the IP's into a new file, removing duplicates.

The question in hand is for each line in the file of unique IP's, search the initial log file for how many times each IP occurs and output a file where each line is simply a count of the IP's occurrences (doesn't contain the actual IP - doesn't make much sense but that's for the next question).

The new file should also only contain the numbers for the 10 highest occurrences.

I am told I should be using xargs, also note this is in no way for any kind of test / exam.

Many Thanks.

Using xargs for this seems misdirected and inefficient. With Awk you can traverse the log file just once.

awk 'NR == FNR { a[$0] = 0; next }
{ for (i=1; i<=NF; ++i) if ($i in a) a[$i]++ }
END { for(k in a) if (a[k]) print a[k], k }' iplist.txt logfile.log

The Awk idiom NR == FNR { ...; next } NR == FNR { ...; next } lets you read the first file into memory, so that you can then check subsequent files against the structure you have in memory.

We read each IP address into the associative array a as a key; then in subsequent files we iterate over each word on the line and check if it's one of the keys in a ; if so, we increment the count in the associative array.

Non- xargs oneliner:

while read -r ip; do grep -Fwc "$ip" logfile; done < ips.txt | sort -rn | head -10

One possible xargs approach, using it to run several grep s in parallel (You can also use GNU parallel for this):

xargs -a ips.txt -I'{}' -n1 -P4 grep -Fwc '{}' logfile | sort -rn | head -10

This reads lines from ips.txt instead of standard input, and launches up to 4 copies of grep at a time, each with a single input line used as the word for that grep to count matches of.

With GNU Parallel it would look like this:

parallel -j0 --tag grep -wFc {} log < ips | sort -k2nr | head

If log is big you will want to void reading it again and again, and for that you can use --tee as long as you can run one process per IP:

cat log | parallel --tee --pipe --tag grep -wFc {} :::: ips | sort -k2nr | head

The solution uses GNU Parallel, so you cannot use this solution, if you are not allowed to use perl programs. So this answer is primarily written as reference for others, who do not have that restriction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM