简体   繁体   中英

Comparison of cat pipe awk operation to awk command on a file

While trying to optimize few of my server related data, I and my team had a discussion over the usage of linux commands. Would request the members to please help us understand the concept more precisely.

On servers we have log files which are created every minutes and we need to search logs with specific tags for example: Error logs, Timeout Logs, Request fail logs. Out of many, one requirement is to provide information about the count of these tags

The simple logic would be to awk the specific field(with delimiter) to sort and uniq -c command to count the number of such instances.

I can see two ways to perform it:

cat fname | awk -F":" {'print $1'} | sort | uniq -c

and

awk -F":" {'print $1'} fname | sort | uniq -c

The file size can go in GB's so which command could be more effective.

There are 3 ways to open a file and have awk operate on it's contents:

  1. cat opens the file:

     cat file | awk '...' 
  2. shell redirection opens the file:

     awk '...' < file 
  3. awk opens the file

     awk '...' file 

Of those choices:

  1. is always to be avoided as the cat and pipe are using resources and providing no value, google UUOC (Useless Use Of Cat) for details.

Which of the other 2 to use is debatable:

  1. has the advantage that the shell is opening the file rather than the tool so you can rely on consistent error handling if you do this for all tools
  2. has the advantage that the tool knows the name of the file it is operating on (eg FILENAME in awk) so you can use that internally.

To see the difference, consider these 2 files:

$ ls -l file1 file2
-rw-r--r-- 1 Ed None 4 Mar 30 09:55 file1
--w------- 1 Ed None 0 Mar 30 09:55 file2
$ cat file1
a
b
$ cat file2
cat: file2: Permission denied

and see what happens when you try to run awk on the contents of both using both methods of opening them:

$ awk '{print FILENAME, $0}' < file1
- a
- b

$ awk '{print FILENAME, $0}' file1
file1 a
file1 b

$ awk '{print FILENAME, $0}' < file2
-bash: file2: Permission denied

$ awk '{print FILENAME, $0}' file2
awk: fatal: cannot open file `file2' for reading (Permission denied)

Note that the error message for opening the unreadable file, file2, when you use redirection came from the shell and so looked exactly like the error message when I first tried to cat it while the error message when letting awk open it came from awk and is different from the shell message and would be different across various awks.

Note that when using awk to open the file, FILENAME was populated with the name of the file being operated on but when using redirection to open the file it was set to - .

I personally think that the benefit of "3" (populated FILENAME) vastly outweighs the benefit of "2" (consistent error handling of file open errors) and so I would always use:

awk '...' file

and for your particular problem you'd use:

awk -F':' '{cnt[$1]++} END{for (i in cnt) print cnt[i], i}' fname

Definitely useless cat should be avoided by using:

awk -F":" '{print $1}' fname | sort | uniq -c

But my recommendation is to even avoid expensive sort and uniq command by finding unique items in awk itself by using:

awk -F":" '!seen[$1]++' fname

This will print unique lines.

To get unique counts:

awk -F":" '!count[$1]++{c++} END{print c}' fname

cat fname | slows things down a little as it has to copy the file from the disk to the kernel then to cat's buffer, then to a pipe, which goes to the kernel again, and then to another process. It's not by a lot, as it should only be a linear slow-down and in-memory copying is quite fast, but you can always (=without depending on some_command s accepting file arguments) speed things up by replacing

cat one_file_name | some_command 

with

<one_file_name some_command

which will be faster, as it will directly set one_file_name as stdin of some_command .

<one_file_name can be and often is placed after some_command and before the next pipe symbol. I personally often like to start with it as it mirrors the left-to-right flow of the useless, but somewhat common use of cat ( cat one_file_name ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM