[英]Why is Python3 much slower than Python2 on my task?
I was surprised to know that Python 3.5.2
is much slower than Python 2.7.12
. 我很惊讶地知道
Python 3.5.2
比Python 2.7.12
慢得多。 I wrote a simple command line command that calculates the number of lines in a huge CSV-file. 我编写了一个简单的命令行命令,该命令可计算巨大的CSV文件中的行数。
$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds
$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds
Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. Python 2.7.12花了15秒,Python 3.5.2花了66秒。 I expected that the difference may take place, but why is it so huge?
我期望可能会发生差异,但是为什么会有如此之大呢? What's new in Python 3 that makes it much slower towards such kind of tasks?
Python 3的新功能使执行此类任务的速度大大降低了? Is there a faster way to calculate the number of lines in Python 3?
有没有一种更快的方法来计算Python 3中的行数?
My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
. 我的CPU是
Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
。
The size of huge.csv
is 18.1 Gb and it contains 101253515 lines. huge.csv
的大小为18.1 Gb,包含101253515行。
Asking this question, I don't need exactly to find the number of lines of a big file at any cost. 问这个问题,我不需要不惜一切代价找到一个大文件的行数。 I just wrote a particular case where Python 3 is much slower.
我只是写了一个Python 3慢得多的特殊情况。 Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using
csv
library. 实际上,我正在用Python 3开发一个处理大型CSV文件的脚本,某些操作不要求使用
csv
库。 I know, I could write the script in Python 2, and it would be acceptable towards the speed. 我知道,我可以用Python 2编写脚本,并且在速度上可以接受。 But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches.
但是我想知道一种用Python 3编写类似脚本的方法。这就是为什么我对在示例中使Python 3变慢的原因以及如何通过“诚实的” python方法加以改进的原因感到感兴趣。
sys.stdin
object is a bit more complicated in Python3 then it was in Python2. sys.stdin
对象在Python3中要比在Python2中复杂一些。 For example by default reading from sys.stdin
in Python3 converts the input into unicode, thus it fails on non-unicode bytes: 例如,通过默认从读取
sys.stdin
在Python3将输入成unicode,因而它无法在非unicode字节:
$ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <genexpr>
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
Note that Python2 doesn't have any problem with that input. 请注意,Python2对该输入没有任何问题。 So as you can see Python3's
sys.stdin
does more things under the hood. 因此,如您所见,Python3的
sys.stdin
在sys.stdin
执行了更多操作。 I'm not sure if this is exactly responsible for the performance loss but you can investigate it further by trying sys.stdin.buffer
under Python3: 我不确定这是否是造成性能下降的原因,但是您可以通过在
sys.stdin.buffer
下尝试sys.stdin.buffer
来进一步调查:
import sys
print(sum(1 for _ in sys.stdin.buffer))
Note that .buffer
doesn't exist in Python2. 请注意,
.buffer
在Python2中不存在。 I've done some tests and I don't see real difference in performance between Python2's sys.stdin
and Python3's sys.stdin.buffer
but YMMV. 我已经做过一些测试,但没有看到Python2的
sys.stdin
和Python3的sys.stdin.buffer
之间的真正性能差异,但是YMMV。
EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. 编辑这是我机器上的一些随机结果:ubuntu 16.04,i7 cpu,8GiB RAM。 First some C code (as a base for comparison):
首先是一些C代码(作为比较的基础):
#include <unistd.h>
int main() {
char buffer[4096];
size_t total = 0;
while (true) {
int result = ::read(STDIN_FILENO, buffer, sizeof(buffer));
total += result;
if (result == 0) {
break;
}
}
return 0;
};
now the file size: 现在文件大小:
$ ls -s --block-size=M | grep huge2.txt
10898M huge2.txt
and tests: 和测试:
// a.out is a simple C equivalent code (except for the final print)
$ time cat huge2.txt | ./a.out
real 0m20.607s
user 0m0.236s
sys 0m10.600s
$ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889
real 1m24.268s
user 1m20.216s
sys 0m8.724s
$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))"
898773889
real 1m19.734s
user 1m14.432s
sys 0m11.940s
$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889
real 2m0.326s
user 1m56.148s
sys 0m9.876s
So the file I've used was a bit smaller and times were longer ( it seems that you have a better machine and I didn't have patience for larger files :D ). 因此,我使用的文件较小,时间也更长(似乎您有一台更好的机器,而我对较大的文件:D则没有耐心)。 Anyway Python2 and Python3's
sys.stdin.buffer
are quite similar in my tests. 无论如何,Python2和Python3的
sys.stdin.buffer
在我的测试中非常相似。 Python3's sys.stdin
is way slower. Python3的
sys.stdin
慢一些。 And all of them are waaaay behind the C code (which has almost 0 user time). 而且所有这些都在C代码后面(几乎有0个用户时间)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.