简体   繁体   English

如何确定BASH中单行脚本的选择更快?

[英]How to determine which choice of one-line scripts is faster in BASH?

I have a script which is run a few million times in a single week, that simply finds the first cell in a CSV filed called file.csv that matches $word exactly, and prints the whole line, eg CSV: 我有一个脚本,该脚本在一周内运行了几百万次,该脚本只是在名为file.csv的CSV文件中file.csv$word完全匹配的第一个单元格,并打印整行,例如CSV:

robot@mechanical@a machine that does automated work
fish@animal@an animal that lives in the sea
tree@plant@a plant that grows in the forest

If one searched for "tree", then this would be printed: 如果搜索“树”,则将其打印出来:

tree@plant@a plant that grows in the forest

These two approaches get the same results: 这两种方法获得相同的结果:

awk -F@ -v pattern="$word" '$1 ~ "^" pattern "$" {print; exit}' file.csv

grep ^$word@ file.csv | head -1

Similarly, this can be used to check for an exact match in the second column of the CSV, assuming there are 3 columns: 同样,假设有3列,则可用于检查CSV第二列中的确切匹配项:

awk -F@ -v pattern="$word" '$2 ~ "^" pattern "$" {print; exit}' file.csv

grep ^.*@$word@.*@.*$ file.csv | head -1

Given a choice of two scripts, such as this example above, which always produce exactly the same output, how can I quickly determine which will be faster? 给定两个脚本的选择,例如上面的示例,它们总是产生完全相同的输出,我如何快速确定哪个会更快?

You determine which is faster by measuring it. 您可以通过测量确定哪个更快。 The time command is your first stop. time命令是您的第一站。

What should you time? 你应该几点 How do you define "quickly"? 您如何定义“快速”? This obviously depends, but if you expect most words to match, you could time how long the middlemost line in the file takes. 这显然取决于,但是如果您希望大多数单词都匹配,则可以计时文件中最中间的行需要花费多长时间。 Say you have 999 lines in the CSV file, and the 499th line uniquely contains "gollum"; 假设CSV文件中有999行,而第499行唯一包含“ gollum”;

time grep -m 1 '^gollum@' file.csv >/dev/null
time awk -F @ '$1 ~ "gollum" { print; exit }' file >/dev/null

Are the line lengths not roughly uniform? 线长是否大致均匀? Do you mainly expect searches to fail? 您主要希望搜索失败吗? Most matches near the beginning of the file? 文件开头附近是否有最匹配的内容? Then adjust your experiment accordingly. 然后相应地调整实验。

A common caveat is that disk I/O caching will make reruns quicker. 一个常见的警告是磁盘I / O缓存将使重新运行更快。 In order to get comparable results, always perform a dummy run first to make sure the cache is populated for the real runs. 为了获得可比的结果,请始终首先执行虚拟运行,以确保为实际运行填充了缓存。 Probably rerun each experiment a few times so you can average out temporary variations in system load, etc. 大概重新运行每个实验几次,以便可以平均出系统负载等的暂时变化。

You can also reason about your problem. 您还可以对问题进行推理。 Other things being equal, I would expect grep to be faster, because it does less parsing both during startup and when processing each input line. 在其他条件相同的情况下,我希望grep更快,因为它在启动期间和处理每条输入线时都不会进行解析。 But sometimes optimizations in one or the other (or a poorly chosen expression which ends up comparing apples to oranges, as in your last grep ) throw off such common-sense results. 但是有时对一个或另一个(或选择不当的表达式进行优化,最终导致将苹果与橙子进行比较,如您的最后一个grep )中的优化会抛弃这种常识性的结果。

If you really care about efficiency then avoid regex for exact match and use both commds as: 如果您真的在乎效率,那么请避免使用正则表达式进行完全匹配,并将这两个commds用作:

awk -F'@' -v pattern="$word" '$1 == pattern{print; exit}' file.csv

grep -m1 -F "$word@" file.csv

To do some benchmarking use time command as: 要进行一些基准测试,请使用time命令:

time awk -F'@' -v pattern="$word" '$1 == pattern{print; exit}' file.csv

time grep -m1 -F "$word@" file.csv

Let them run on your file in a loop for ~1mio times and print the time needed for both scripts (end - start). 让它们在您的文件中循环运行〜1mio次,并打印两个脚本所需的时间(结束-开始)。 One will be faster than the other. 一个会比另一个快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM