简体   繁体   English

我如何加快速度?

[英]How do I speed this up?

The following code makes a list of names and 'numbers' and gives each person a random age between 15 and 90. 以下代码列出了名称和“数字”,并为每个人提供了15到90岁之间的随机年龄。

#!/bin/sh

file=$1
n=$2

# if number is zero exit
if [ "$n" -eq "0" ]
then
    exit 0
fi

echo "Generating list of $n people."

for i in `seq 1 $n`;
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####" >> $file
done

echo "List generated."

With it I'm attempting to make a list of 1M names. 有了它,我正在尝试列出1M名称。 It's slow, I expected that; 它很慢,我预料到了; it was so slow tho that I lost patience and tried 10K names. 它太慢了,我失去了耐心,尝试了10K的名字。 That was slow too, but it got done in a few seconds. 那也很慢,但它在几秒钟内就完成了。

The reason I'm generating the names is to sort them. 我生成名称的原因是对它们进行排序。 What surprised me is that when I sorted the list of 10K names it was instant. 令我惊讶的是,当我对10K名称列表进行排序时,它是即时的。

How can this be? 怎么会这样?

Is there something that's making this go ungodly slow? 是否有什么东西让这个变得不那么慢? Both the sorting and the generating are accessing files so how can the sorting be faster? 排序和生成都是访问文件,那么排序如何更快? Is my random number math in the list generator what's slowing it down? 列表生成器中的随机数数学是什么减慢了它?

Here's my sorting script. 这是我的排序脚本。

#!/bin/sh
#first argument is list to be sorted, second is output file
tr -s '' < $1 | sort -n -k2 > $2

Using the shell to generate random numbers like this isn't really what it was designed to do. 使用shell生成这样的随机数并不是它的设计目的。 You'll likely be better off coding something to generate random numbers from a uniform distribution in another language, like Fortran, Perl or C. 你可能会更好地编写一些东西,用另一种语言的统一分布生成随机数,比如Fortran,Perl或C.

In your code, one thing that's going to be very slow is generating a sequence of numbers from 1..1e7 and assigning them all to a variable. 在你的代码中,一件非常慢的事情是从1..1e7生成一系列数字并将它们全部分配给变量。 That's likely very wasteful, but you should profile if you want to be sure. 这可能非常浪费,但如果你想确定,你应该描述一下。 As chaos points out, appending to the file is also likely to be very costly! 正如混乱所指出的那样,附加到文件也可能非常昂贵!

In Python, you can do something like this: 在Python中,您可以执行以下操作:

#!/usr/bin/python
import random
count = 1

print ' '.join( ['name', 'age'] )
while count <= 1000000:
    age = random.randrange(15,90)
    count = count + 1
    name = 'name' + str(count)
    print ' '.join( [ name, str(age) ] )

Running that on my laptop takes ~10 seconds. 在笔记本电脑上运行需要大约10秒钟。 Assigning the seq from 1 to 1000000 takes ~10 seconds, when you add the random number generation your script takes over three minutes on the same machine. 将seq从1分配给1000000需要大约10秒,当您添加随机数生成时,您的脚本在同一台机器上花费超过三分钟。 I got frustrated just as you did, and played around with the script to try and make it faster. 我和你一样感到沮丧,并且使用脚本来尝试让它更快。 Here's my shortened version of your code that I'm playing with: 这是我正在使用的缩短版代码:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM >> test.txt
done

Running this takes about 5.3s: 运行这个大概需要5.3s:

$ time ./test.sh
real    0m5.318s
user    0m1.305s
sys     0m0.675s

Removing the file appending and simply redirecting STDOUT to a single file gives the following script: 删除文件追加并简单地将STDOUT重定向到单个文件提供以下脚本:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM
done

Running this takes about half a second: 运行这个大约需要半秒钟:

$ time ./test.sh > test.txt
real    0m0.516s
user    0m0.449s
sys     0m0.067s

The slowness of your program is at least partly due to appending to that file. 程序的缓慢至少部分是由于附加到该文件。 Curiously, when I tried to swap the seq call with a for loop, I didn't notice any speedup. 奇怪的是,当我尝试用for循环交换seq调用时,我没有注意到任何加速。

for i in `seq 1 $n`

Yikes! 哎呀! This is generating 1,000,000 arguments to the for loop. 这会for循环生成1,000,000个参数。 That seq call will take a long, long, long time. 那个seq电话需要很长长时间。 Try 尝试

for ((i = 1; i <= n; i++))

Notice the lack of dollar signs, by the way. 顺便提一下,请注意缺少美元符号。 Peculiarly, the var++ syntax requires you to omit the dollar sign from the variable name. 特别是, var++语法要求您从变量名中省略美元符号。 You are also allowed to use or to omit them elsewhere: it could be i <= n or $i <= $n , either one. 您也可以在其他地方使用或省略它们:它可以是i <= n$i <= $n ,任何一个。 The way I figure, you should omit dollar signs entirely in let , declare , and for ((x; y; z)) statements. 我的方式,你应该完全在letdeclarefor ((x; y; z))语句中省略美元符号。 See the ARITHMETIC EVALUATION section of the sh man page for the full explanation. 见算术评价科sh手册页完整的解释。

Not a new answer, just new code. 不是新的答案,只是新的代码。

This is what IMHO is a good middle way between nice and efficient code (as efficient as you can be in Bash, it IS slow, it's a shell...) 这就是恕我直言,它是一个很好的高效代码之间的良好中间路径(就像你在Bash中一样高效,它很慢,它是一个shell ...)

for ((i=1;i<=n;i++));
do
  echo "name$i $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Alternative, not using a classic counter loop 替代方案,不使用经典的计数器循环

i=1
while ((i<=n)); do
  echo "name$((i++)) $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Both are about the same speed. 两者速度大致相同。

The fixes are the same as mentioned by all the others: 修复程序与所有其他修复程序相同:

  • do not frequently close and re-open the file 不要经常关闭并重新打开文件
  • use shell arithmetics 使用shell算术
  • ah yes, and use QUOTES, but that's for sanity, not for speed 啊是的,并使用QUOTES,但这是为了理智,而不是速度

I guess the '>> $file' can be the source of your problem. 我想'>> $ file'可能是你问题的根源。 On my system your script takes 10 seconds to generate 10000. If I remove the $file argument and instead just use stdout and capture the whole thing to a file it takes under a second. 在我的系统上,你的脚本需要10秒才能生成10000.如果我删除$ file参数,而只是使用stdout并将整个事件捕获到一个文件,它需要一秒钟。

$ time ./gen1.sh n1.txt 10000 Generating list of 10000 people. $ time ./gen1.sh n1.txt 10000生成10000人的列表。 List generated. 列表生成。

real 0m7.552s user 0m1.355s sys 0m1.886s 真正的0m7.552s用户0m1.355s sys 0m1.886s

$ time ./gen2.sh 10000 > n2.txt $ time ./gen2.sh 10000> n2.txt

real 0m0.806s user 0m0.576s sys 0m0.140s 实际0m0.806s用户0m0.576s sys 0m0.140s

Don't know if it's the whole story, but re-opening the file to append to it for every name can't be helping anything. 不知道这是不是整个故事,但重新打开文件附加到它的每个名字都无济于事。 Doing the whole thing in any context where you can keep an open file handle to write to should help a lot. 在任何可以保持打开文件句柄写入的上下文中完成所有操作应该会有很大帮助。

Try this for your main loop: 试试这个主循环:

seq 1 $n | while read i
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####"
done > $file

This will make the seq and the loop work in parallel instead of waiting for the seq to finish before starting the loop. 这将使seq和循环并行工作,而不是在开始循环之前等待seq完成。 This will be faster on multiple cores/CPUs but slightly slower on a single core. 这在多核/ CPU上会更快,但在单核上稍慢。

And I agree with the others here: Does it have to be bash? 我同意这里的其他人:它必须是bash吗?

Edit: add chaos' suggestion to keep the file open, not open for append for each name. 编辑:添加混乱的建议以保持文件打开,不打开以附加每个名称。

(I have a feeling you may not like this answer, but you technically didn't specify the answer had to remain in bash! :P) (我有一种感觉你可能不喜欢这个答案,但你在技术上没有指明答案必须留在bash!:P)

It's common to rapidly develop something in prototyping language, and then possibly switch to another language (often C) as needed. 在原型语言中快速开发一些东西是很常见的,然后可能根据需要切换到另一种语言(通常是C语言)。 Here's a very similar program in Python for you to compare: 这是一个非常类似的Python程序供您比较:

#!/usr/bin/python
import sys
import random

def main(args=None):
    args = args or []
    if len(args) == 1:
        # default first parameter
        args = ["-"] + args
    if len(args) != 2:
        sys.stderr.write("error: invalid parameters\n")
        return 1
    n = int(args[1])
    output = sys.stdout if args[0] == "-" else open(args[0], "a")

    for i in xrange(1, n + 1):
        num = random.randint(0, 74)
        output.write("name%s %s (###)###-####\n" % (i, num))

    sys.stderr.write("List generated.\n") # see note below

if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))

Note: Only using stdout for "real output" instead of status notifications allows this program to be run in parallel with others, piping data directly from stdout of one to stdin of another. 注意:仅使用stdout进行“实际输出”而不是状态通知允许此程序与其他程序并行运行,将数据直接从一个stdout传送到另一个stdin。 (It's possible with special files in *nix, but just easier if you can use stdout.) Example: (可以使用* nix中的特殊文件,但如果可以使用stdout则更容易。)示例:

$./rand_names.py 1000000 | sort -n -k2 > output_file

And it should be fast enough: 它应该足够快:

$time ./rand_names.py 1000000 > /dev/null
List generated.

real    0m16.393s
user    0m15.108s
sys     0m0.171s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM