如何減少 C API 和 Python 可執行文件之間的執行時間差異？

Question

使用python3或通過使用libpython3的嵌入式解釋器運行相同的 python 腳本會產生不同的執行時間。

$ time PYTHONPATH=. ./simple
real    0m6,201s
user    1m3,680s
sys     0m0,212s

$ time PYTHONPATH=. python3 -c 'import test; test.run()'
real    0m5,193s
user    0m53,349s
sys     0m0,164s

（在運行之間刪除__pycache__的內容似乎沒有影響）

目前，使用腳本調用python3更快； 在我的實際用例中，與從嵌入式解釋器中運行的相同腳本相比，該系數快 1.5。

我想（1）了解差異來自哪里以及（2）是否可以使用嵌入式解釋器獲得相同的性能？ （使用例如 cython 目前不是一種選擇）。

代碼

簡單的.cpp

 g++ -std=c++11 -fPIC $(python3-config --cflags) simple.cpp \
 $(python3-config --ldflags) -o simple

匯編：

 g++ -std=c++11 -fPIC $(python3-config --cflags) simple.cpp \\ $(python3-config --ldflags) -o simple

測試文件

import sys sys.stdout = open('output.bin', 'bw') import mandel def run(): mandel.mandelbrot(4096)

曼德爾.py

來自benchmarks-game 的 Mandlebrot 的調整版本（見許可證）

 from contextlib import closing from itertools import islice from os import cpu_count from sys import stdout def pixels(y, n, abs): range7 = bytearray(range(7)) pixel_bits = bytearray(128 >> pos for pos in range(8)) c1 = 2. / float(n) c0 = -1.5 + 1j * y * c1 - 1j x = 0 while True: pixel = 0 c = x * c1 + c0 for pixel_bit in pixel_bits: z = c for _ in range7: for _ in range7: z = z * z + c if abs(z) >= 2.: break else: pixel += pixel_bit c += c1 yield pixel x += 8 def compute_row(p): y, n = p result = bytearray(islice(pixels(y, n, abs), (n + 7) // 8)) result[-1] &= 0xff << (8 - n % 8) return y, result def ordered_rows(rows, n): order = [None] * n i = 0 j = n while i < len(order): if j > 0: row = next(rows) order[row[0]] = row j -= 1 if order[i]: yield order[i] order[i] = None i += 1 def compute_rows(n, f): row_jobs = ((y, n) for y in range(n)) if cpu_count() < 2: yield from map(f, row_jobs) else: from multiprocessing import Pool with Pool() as pool: unordered_rows = pool.imap_unordered(f, row_jobs) yield from ordered_rows(unordered_rows, n) def mandelbrot(n): write = stdout.write with closing(compute_rows(n, compute_row)) as rows: write("P4\\n{0} {0}\\n".format(n).encode()) for row in rows: write(row[1])

Answer 1

所以顯然時間差來自於靜態鏈接libpython與動態鏈接。 在python.c旁邊的 Makefile（來自參考實現）中，以下內容構建了解釋器的靜態鏈接版本：

snake: python.c
    g++ \
    -I/usr/include/python3.6m \
    -pthread \
    -specs=/usr/share/dpkg/no-pie-link.specs \
    -specs=/usr/share/dpkg/no-pie-compile.specs \
    \
    -Wall \
    -Wformat \
    -Werror=format-security \
    -Wno-unused-result \
    -Wsign-compare \
    -DNDEBUG \
    -g \
    -fwrapv \
    -fstack-protector \
    -O3 \
    \
    -Xlinker -export-dynamic \
    -Wl,-Bsymbolic-functions \
    -Wl,-z,relro \
    -Wl,-O1 \
    python.c \
    /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.a \
    -lexpat \
    -lpthread \
    -ldl \
    -lutil \
    -lexpat \
    -L/usr/lib \
    -lz \
    -lm \
    -o $@

使用-llibpython3.6m更改/usr/lib/.../libpython3.6m.a -llibpython3.6m構建最終變慢的版本（還需要-L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu )

結語

速度差異存在，但不是我原來問題的完整答案； 實際上，“較慢”的解釋器是在特定的 LD_PRELOAD 環境下執行的，該環境改變了系統時間函數的行為方式，與cProfile 混淆。

如何減少 C API 和 Python 可執行文件之間的執行時間差異？

問題描述

代碼

1 個解決方案

解決方案1
2 已采納 2019-09-06 14:43:11

如何減少 C API 和 Python 可執行文件之間的執行時間差異？

問題描述

代碼

1 個解決方案

解決方案1 2 已采納 2019-09-06 14:43:11

解決方案1
2 已采納 2019-09-06 14:43:11