简体   繁体   English

cat管awk操作与awk命令在文件上的比较

[英]Comparison of cat pipe awk operation to awk command on a file

While trying to optimize few of my server related data, I and my team had a discussion over the usage of linux commands. 在尝试优化我的服务器相关数据时,我和我的团队讨论了linux命令的用法。 Would request the members to please help us understand the concept more precisely. 请求会员请帮助我们更准确地理解这个概念。

On servers we have log files which are created every minutes and we need to search logs with specific tags for example: Error logs, Timeout Logs, Request fail logs. 在服务器上,我们有每分钟创建的日志文件,我们需要搜索具有特定标签的日志,例如:错误日志,超时日志,请求失败日志。 Out of many, one requirement is to provide information about the count of these tags 在许多中,一个要求是提供有关这些标签计数的信息

The simple logic would be to awk the specific field(with delimiter) to sort and uniq -c command to count the number of such instances. 简单的逻辑是将特定字段(带分隔符)awk到sort和uniq -c命令来计算此类实例的数量。

I can see two ways to perform it: 我可以看到两种方法来执行它:

cat fname | awk -F":" {'print $1'} | sort | uniq -c

and

awk -F":" {'print $1'} fname | sort | uniq -c

The file size can go in GB's so which command could be more effective. 文件大小可以以GB为单位,因此命令可能更有效。

There are 3 ways to open a file and have awk operate on it's contents: 有3种方法可以打开文件并让awk对其内容进行操作:

  1. cat opens the file: cat打开文件:

     cat file | awk '...' 
  2. shell redirection opens the file: shell重定向打开文件:

     awk '...' < file 
  3. awk opens the file awk打开文件

     awk '...' file 

Of those choices: 其中的选择:

  1. is always to be avoided as the cat and pipe are using resources and providing no value, google UUOC (Useless Use Of Cat) for details. 总是要避免,因为cat和管道正在使用资源并且没有提供任何价值,谷歌UUOC(无用的猫)的细节。

Which of the other 2 to use is debatable: 使用其他2个中的哪个是值得商榷的:

  1. has the advantage that the shell is opening the file rather than the tool so you can rely on consistent error handling if you do this for all tools shell的优点是打开文件而不是工具,因此如果对所有工具执行此操作,则可以依赖一致的错误处理
  2. has the advantage that the tool knows the name of the file it is operating on (eg FILENAME in awk) so you can use that internally. 具有以下优点:该工具知道它正在操作的文件的名称(例如,awk中的FILENAME),因此您可以在内部使用它。

To see the difference, consider these 2 files: 要查看差异,请考虑以下两个文件:

$ ls -l file1 file2
-rw-r--r-- 1 Ed None 4 Mar 30 09:55 file1
--w------- 1 Ed None 0 Mar 30 09:55 file2
$ cat file1
a
b
$ cat file2
cat: file2: Permission denied

and see what happens when you try to run awk on the contents of both using both methods of opening them: 看看当你尝试使用两种打开它们的方法对两者的内容运行awk时会发生什么:

$ awk '{print FILENAME, $0}' < file1
- a
- b

$ awk '{print FILENAME, $0}' file1
file1 a
file1 b

$ awk '{print FILENAME, $0}' < file2
-bash: file2: Permission denied

$ awk '{print FILENAME, $0}' file2
awk: fatal: cannot open file `file2' for reading (Permission denied)

Note that the error message for opening the unreadable file, file2, when you use redirection came from the shell and so looked exactly like the error message when I first tried to cat it while the error message when letting awk open it came from awk and is different from the shell message and would be different across various awks. 请注意,当您使用重定向时打开不可读文件file2的错误消息来自shell,所以看起来就像我第一次尝试cat时的错误消息,而当awk打开它时出现错误信息来自awk并且是与shell消息不同,并且在各种awk中会有所不同。

Note that when using awk to open the file, FILENAME was populated with the name of the file being operated on but when using redirection to open the file it was set to - . 请注意,当使用awk打开文件时,FILENAME会填充正在操作的文件的名称,但是当使用重定向打开文件时,它被设置为-

I personally think that the benefit of "3" (populated FILENAME) vastly outweighs the benefit of "2" (consistent error handling of file open errors) and so I would always use: 我个人认为“3”(填充的FILENAME)的好处远大于“2”(文件打开错误的一致错误处理)的好处,因此我总是使用:

awk '...' file

and for your particular problem you'd use: 并针对您使用的特定问题:

awk -F':' '{cnt[$1]++} END{for (i in cnt) print cnt[i], i}' fname

Definitely useless cat should be avoided by using: 使用以下方法应该避免绝对无用的cat

awk -F":" '{print $1}' fname | sort | uniq -c

But my recommendation is to even avoid expensive sort and uniq command by finding unique items in awk itself by using: 但我的建议是通过使用以下方法在awk找到唯一的项目来避免昂贵的sortuniq命令:

awk -F":" '!seen[$1]++' fname

This will print unique lines. 这将打印出独特的线条。

To get unique counts: 获得独特的数量:

awk -F":" '!count[$1]++{c++} END{print c}' fname

cat fname | slows things down a little as it has to copy the file from the disk to the kernel then to cat's buffer, then to a pipe, which goes to the kernel again, and then to another process. 因为它必须将文件从磁盘复制到内核然后再转移到cat的缓冲区,然后再转移到管道,然后再转到内核,再转到另一个进程,因此减慢了一点。 It's not by a lot, as it should only be a linear slow-down and in-memory copying is quite fast, but you can always (=without depending on some_command s accepting file arguments) speed things up by replacing 它不是很多,因为它应该只是一个线性减速和内存中复制非常快,但你总是可以(=不依赖some_command接受文件参数)通过替换加快速度

cat one_file_name | some_command 

with

<one_file_name some_command

which will be faster, as it will directly set one_file_name as stdin of some_command . 这将是更快,因为它会直接设置one_file_name作为stdinsome_command

<one_file_name can be and often is placed after some_command and before the next pipe symbol. <one_file_name可以并且经常放在some_command 之后和下一个管道符号之前。 I personally often like to start with it as it mirrors the left-to-right flow of the useless, but somewhat common use of cat ( cat one_file_name ). 我个人经常喜欢从它开始,因为它反映了cat( cat one_file_name )的无用但有些常见用法的从左到右的流程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM