Why is this grep filter slow?

Question

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.

Without the one-letter exclusion it runs extremely fast:

time cat /usr/share/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    0m0.227s
user    0m0.375s
sys 0m0.021s

grepping on ' .. ', however, is painfully slow:

time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    1m16.319s
user    1m0.694s
sys 0m10.225s

What's going on here?

Answer 1

The problem is the UTF-8 Locale, easy workaround for 100x speedup

What's really slow on the Mac is the UTF-8 locale.

Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.

This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.

Answer 2

I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v , and throw away all one-character lines:

$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    0m0.086s
user    0m0.090s
sys  0m0.000s

Answer 3

这可能会运行得更好，也可以摆脱你需要另一个管道的切割。

cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

Answer 4

如果你减少使用过多的管道和无用的猫，它甚至可能会更快

$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

Why is this grep filter slow?

Question

4 answers

solution1
9 ACCPTED 2011-03-22 22:35:17

The problem is the UTF-8 Locale, easy workaround for 100x speedup

solution2
2 2011-03-22 22:22:54

solution3
1 2011-03-22 22:37:40

solution4
1 2011-03-22 23:43:05

Why is this grep filter slow?

Question

4 answers

solution1 9 ACCPTED 2011-03-22 22:35:17

The problem is the UTF-8 Locale, easy workaround for 100x speedup

solution2 2 2011-03-22 22:22:54

solution3 1 2011-03-22 22:37:40

solution4 1 2011-03-22 23:43:05

solution1
9 ACCPTED 2011-03-22 22:35:17

solution2
2 2011-03-22 22:22:54

solution3
1 2011-03-22 22:37:40

solution4
1 2011-03-22 23:43:05