While trying to optimize few of my server related data, I and my team had a discussion over the usage of linux commands. Would request the members to please help us understand the concept more precisely.
On servers we have log files which are created every minutes and we need to search logs with specific tags for example: Error logs, Timeout Logs, Request fail logs. Out of many, one requirement is to provide information about the count of these tags
The simple logic would be to awk the specific field(with delimiter) to sort and uniq -c command to count the number of such instances.
I can see two ways to perform it:
cat fname | awk -F":" {'print $1'} | sort | uniq -c
and
awk -F":" {'print $1'} fname | sort | uniq -c
The file size can go in GB's so which command could be more effective.
There are 3 ways to open a file and have awk operate on it's contents:
cat opens the file:
cat file | awk '...'
shell redirection opens the file:
awk '...' < file
awk opens the file
awk '...' file
Of those choices:
cat
and pipe are using resources and providing no value, google UUOC (Useless Use Of Cat) for details. Which of the other 2 to use is debatable:
To see the difference, consider these 2 files:
$ ls -l file1 file2
-rw-r--r-- 1 Ed None 4 Mar 30 09:55 file1
--w------- 1 Ed None 0 Mar 30 09:55 file2
$ cat file1
a
b
$ cat file2
cat: file2: Permission denied
and see what happens when you try to run awk on the contents of both using both methods of opening them:
$ awk '{print FILENAME, $0}' < file1
- a
- b
$ awk '{print FILENAME, $0}' file1
file1 a
file1 b
$ awk '{print FILENAME, $0}' < file2
-bash: file2: Permission denied
$ awk '{print FILENAME, $0}' file2
awk: fatal: cannot open file `file2' for reading (Permission denied)
Note that the error message for opening the unreadable file, file2, when you use redirection came from the shell and so looked exactly like the error message when I first tried to cat
it while the error message when letting awk open it came from awk and is different from the shell message and would be different across various awks.
Note that when using awk to open the file, FILENAME was populated with the name of the file being operated on but when using redirection to open the file it was set to -
.
I personally think that the benefit of "3" (populated FILENAME) vastly outweighs the benefit of "2" (consistent error handling of file open errors) and so I would always use:
awk '...' file
and for your particular problem you'd use:
awk -F':' '{cnt[$1]++} END{for (i in cnt) print cnt[i], i}' fname
Definitely useless cat
should be avoided by using:
awk -F":" '{print $1}' fname | sort | uniq -c
But my recommendation is to even avoid expensive sort
and uniq
command by finding unique items in awk
itself by using:
awk -F":" '!seen[$1]++' fname
This will print unique lines.
To get unique counts:
awk -F":" '!count[$1]++{c++} END{print c}' fname
cat fname |
slows things down a little as it has to copy the file from the disk to the kernel then to cat's buffer, then to a pipe, which goes to the kernel again, and then to another process. It's not by a lot, as it should only be a linear slow-down and in-memory copying is quite fast, but you can always (=without depending on some_command
s accepting file arguments) speed things up by replacing
cat one_file_name | some_command
with
<one_file_name some_command
which will be faster, as it will directly set one_file_name as stdin
of some_command
.
<one_file_name
can be and often is placed after some_command
and before the next pipe symbol. I personally often like to start with it as it mirrors the left-to-right flow of the useless, but somewhat common use of cat ( cat one_file_name
).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.