简体   繁体   English

为什么Python3在执行任务时比Python2慢得多?

[英]Why is Python3 much slower than Python2 on my task?

I was surprised to know that Python 3.5.2 is much slower than Python 2.7.12 . 我很惊讶地知道Python 3.5.2Python 2.7.12慢得多。 I wrote a simple command line command that calculates the number of lines in a huge CSV-file. 我编写了一个简单的命令行命令,该命令可计算巨大的CSV文件中的行数。

$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds

$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds

Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. Python 2.7.12花了15秒,Python 3.5.2花了66秒。 I expected that the difference may take place, but why is it so huge? 我期望可能会发生差异,但是为什么会有如此之大呢? What's new in Python 3 that makes it much slower towards such kind of tasks? Python 3的新功能使执行此类任务的速度大大降低了? Is there a faster way to calculate the number of lines in Python 3? 有没有一种更快的方法来计算Python 3中的行数?

My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz . 我的CPU是Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz

The size of huge.csv is 18.1 Gb and it contains 101253515 lines. huge.csv的大小为18.1 Gb,包含101253515行。

Asking this question, I don't need exactly to find the number of lines of a big file at any cost. 问这个问题,我不需要不惜一切代价找到一个大文件的行数。 I just wrote a particular case where Python 3 is much slower. 我只是写了一个Python 3慢得多的特殊情况。 Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using csv library. 实际上,我正在用Python 3开发一个处理大型CSV文件的脚本,某些操作不要求使用csv库。 I know, I could write the script in Python 2, and it would be acceptable towards the speed. 我知道,我可以用Python 2编写脚本,并且在速度上可以接受。 But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches. 但是我想知道一种用Python 3编写类似脚本的方法。这就是为什么我对在示例中使Python 3变慢的原因以及如何通过“诚实的” python方法加以改进的原因感到感兴趣。

sys.stdin object is a bit more complicated in Python3 then it was in Python2. sys.stdin对象在Python3中要比在Python2中复杂一些。 For example by default reading from sys.stdin in Python3 converts the input into unicode, thus it fails on non-unicode bytes: 例如,通过默认从读取sys.stdin在Python3将输入成unicode,因而它无法在非unicode字节:

$ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <genexpr>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Note that Python2 doesn't have any problem with that input. 请注意,Python2对该输入没有任何问题。 So as you can see Python3's sys.stdin does more things under the hood. 因此,如您所见,Python3的sys.stdinsys.stdin执行了更多操作。 I'm not sure if this is exactly responsible for the performance loss but you can investigate it further by trying sys.stdin.buffer under Python3: 我不确定这是否是造成性能下降的原因,但是您可以通过在sys.stdin.buffer下尝试sys.stdin.buffer来进一步调查:

import sys
print(sum(1 for _ in sys.stdin.buffer))

Note that .buffer doesn't exist in Python2. 请注意, .buffer在Python2中不存在。 I've done some tests and I don't see real difference in performance between Python2's sys.stdin and Python3's sys.stdin.buffer but YMMV. 我已经做过一些测试,但没有看到Python2的sys.stdin和Python3的sys.stdin.buffer之间的真正性能差异,但是YMMV。

EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. 编辑这是我机器上的一些随机结果:ubuntu 16.04,i7 cpu,8GiB RAM。 First some C code (as a base for comparison): 首先是一些C代码(作为比较的基础):

#include <unistd.h>

int main() {
    char buffer[4096];
    size_t total = 0;
    while (true) {
        int result = ::read(STDIN_FILENO, buffer, sizeof(buffer));
        total += result;
        if (result == 0) {
            break;
        }
    }
    return 0;
};

now the file size: 现在文件大小:

$ ls -s --block-size=M | grep huge2.txt 
10898M huge2.txt

and tests: 和测试:

// a.out is a simple C equivalent code (except for the final print)
$ time cat huge2.txt | ./a.out

real    0m20.607s
user    0m0.236s
sys     0m10.600s


$ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    1m24.268s
user    1m20.216s
sys     0m8.724s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))"
898773889

real    1m19.734s
user    1m14.432s
sys     0m11.940s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    2m0.326s
user    1m56.148s
sys     0m9.876s

So the file I've used was a bit smaller and times were longer ( it seems that you have a better machine and I didn't have patience for larger files :D ). 因此,我使用的文件较小,时间也更长(似乎您有一台更好的机器,而我对较大的文件:D则没有耐心)。 Anyway Python2 and Python3's sys.stdin.buffer are quite similar in my tests. 无论如何,Python2和Python3的sys.stdin.buffer在我的测试中非常相似。 Python3's sys.stdin is way slower. Python3的sys.stdin慢一些。 And all of them are waaaay behind the C code (which has almost 0 user time). 而且所有这些都在C代码后面(几乎有0个用户时间)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM