简体   繁体   English

bash'while read line'效率与大文件

[英]bash 'while read line' efficiency with big file

I was using a while loop to process a task, 我正在使用while循环来处理任务,

which read records from a big file about 10 million lines. 从一个大文件中读取大约1000万行的记录。

I found that the processing become more and more slower as time goes by. 我发现随着时间的推移,处理变得越来越慢。

and I make a simulated script with 1 million lines as blow, which reveal the problem. 然后我用100万行作为模拟脚本,这揭示了问题。

but I still don't know why, how does the read command work? 但我仍然不知道为什么, read命令如何工作?

seq 1000000 > seq.dat
while read s;
do
    if [ `expr $s % 50000` -eq 0 ];then
        echo -n $( expr `date +%s` - $A) ' ';
        A=`date +%s`;
    fi
done < seq.dat

The terminal outputs the time interval: 终端输出时间间隔:

98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134 98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134

at about 50,000 lines,the processing become slower obviously. 在大约50,000行时,处理速度明显变慢。

Using your code, I saw the same pattern of increasing times (right from the beginning!). 使用你的代码,我看到了相同的增加时间模式(从一开始!)。 If you want faster processing, you should rewrite using shell internal features. 如果您想要更快的处理,则应使用shell内部功能重写。 Here's my bash version: 这是我的bash版本:

tabChar="   "  # put a real tab char here, of course
seq 1000000 > seq.dat
while read s;
do
    if (( ! ( s % 50000 ) )) ;then
        echo $s "${tabChar}" $( expr `date +%s` - $A) 
        A=$(date +%s);
    fi
done < seq.dat

edit fixed bug, output indicated each line was being processed, now only every 50000'th line gets the timing treatment. 编辑固定的bug,输出表明每一行都在处理,现在每50000行只有时间处理。 Doah! Doah!

was

  if ((  s % 50000 )) ;then

fixed to 固定的

  if (( ! ( s % 50000 ) )) ;then

output now echo ${.sh.version} = Version JM 93t+ 2010-05-24 现在输出echo ${.sh.version} = Version JM 93t + 2010-05-24

50000
100000   1
150000   0
200000   1
250000   0
300000   1
350000   0
400000   1
450000   0
500000   1
550000   0
600000   1
650000   0
700000   1
750000   0

output bash 输出bash

50000    480
100000   3
150000   2
200000   3
250000   3
300000   2
350000   3
400000   3
450000   2
500000   2
550000   3
600000   2
650000   2
700000   3
750000   3
800000   2
850000   2
900000   3
950000   2
800000   1
850000   0
900000   1
950000   0
1e+06    1

As to why your original test case is taking so long ... not sure. 至于为什么你的原始测试案例需要这么长时间......不确定。 I was surprised to see both the time for each test cyle AND the increase in time. 我很惊讶地看到每个测试时间和时间的增加。 If you really need to understand this, you may need to spend time instrumenting more test stuff. 如果你真的需要理解这一点,你可能需要花时间来测试更多的测试内容。 Maybe you'd see something running truss or strace (depending on your base OS). 也许你会看到运行trussstrace东西(取决于你的基本操作系统)。

I hope this helps. 我希望这有帮助。

Read is a comparatively slow process, as the author of "Learning the Korn Shell" points out *. 阅读是一个相对缓慢的过程,正如“学习Korn Shell”的作者所指出的那样 (Just above Section 7.2.2.1.) There are other programs, such as awk or sed that have been highly optimized to do what is essentially the same thing: read from a file one line at a time and perform some operations using that input. (在第7.2.2.1节之上。)还有其他程序,例如awksed ,它们经过高度优化,可以执行基本相同的操作:一次读取一行文件并使用该输入执行某些操作。

Not to mention, that you're calling an external process every time you're doing subtraction or taking the modulus, which can get expensive. 更不用说,每次进行减法或取模数时,你都会调用外部过程,这可能会变得昂贵。 awk has both of those functionalities built in. awk内置了这两种功能。

As the following test points out, awk is quite a bit faster: 正如以下测试所指出的那样, awk相当快一些:

#!/usr/bin/env bash

seq 1000000 | 
awk '
  BEGIN {
    command = "date +%s"
    prevTime = 0
  }
  $1 % 50000 == 0 {
    command | getline currentTime
    close(command)

    print currentTime - prevTime
    prevTime = currentTime
  }
'

Output: 输出:

1335629268
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
1   
0   
0   
0   
0

Note that the first number is equivalent to date +%s . 请注意,第一个数字相当于date +%s Just like in your test case, I let the first match be. 就像你的测试用例一样,我让第一场比赛成为。

Note 注意

*Yes the author is talking about the Korn Shell, not bash as the OP tagged, but bash and ksh are rather similar in a lot of ways. *是的,作者正在谈论Korn Shell,而不是OP标记的bash,但bash和ksh在很多方面都非常相似。 ksh is actually a superset of bash. ksh实际上是bash的超集。 So I would assume that the read command is not drastically different from one shell to another. 所以我认为read命令与一个shell到另一个shell没有太大的不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM