连接两个字符串时，为什么Python比C更快？

Question

Currently I want to compare the speed of Python and C when they're used to do string stuff. 目前我想比较Python和C的速度，当它们用于做字符串时。 I think C should give better performance than Python will; 我认为C应该提供比Python更好的性能; however, I got a total contrary result. 但是，我得到了相反的结果。

Here's the C program: 这是C程序：

#include <unistd.h>
#include <sys/time.h>

#define L (100*1024)

char s[L+1024];
char c[2*L+1024];

double time_diff( struct timeval et, struct timeval st )
{
    return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}

int foo()
{
    strcpy(c,s);
    strcat(c+L,s);
    return 0;
}

int main()
{
    struct timeval st;
    struct timeval et;
    int i;
    //printf("s:%x\nc:%x\n", s,c);

    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    memset(s, '1', L);
    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    foo();
    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    //s[1024*100-1]=0;

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 1000; i++ ) foo();
    gettimeofday(&et,NULL);

    printf("%f\n", time_diff(et,st));
    return 0;
}

and this is the Python one: 这是Python之一：

import time

s = '1'*102400
def foo():
    c = s + s
    #assert( len(c) == 204800 )

st = time.time()
for x in xrange(1000):
    foo()
et = time.time()

print (et-st)

and what I get: 我得到了什么：

root@xkqeacwf:~/lab/wfaster# python cp100k.py 
0.027932882309
root@xkqeacwf:~/lab/wfaster# gcc cp100k.c
root@xkqeacwf:~/lab/wfaster# ./a.out 
0.061820

Does that make sense? 那有意义吗？ Or am I just making any stupid mistakes? 或者我只是犯了任何愚蠢的错误？

Answer 1

Accumulated comments (mainly from me) converted into an answer: 累积评论（主要来自我）转换成答案：

What happens if you use your knowledge of the lengths of the strings and use memmove() or memcpy() instead of strcpy() and strcat() ? 如果你使用你对字符串长度的了解并使用memmove()或memcpy()而不是strcpy()和strcat()什么？ (I note that the strcat() could be replaced with strcpy() with no difference in result — it might be interesting to check the timing.) Also, you didn't include <string.h> (or <stdio.h> ) so you're missing any optimizations that <string.h> might provide! （我注意到strcat()可以用strcpy()替换，结果没有区别 - 检查时间可能很有趣。）另外，你没有包含<string.h> （或<stdio.h> ）所以你错过了<string.h>可能提供的任何优化！

Marcus: Yes, memmove() is faster than strcpy() and faster than Python, but why? Marcus：是的， memmove()比strcpy()快，比Python快，但为什么呢？ Does memmove() do a word-width copy at a time? memmove()做一个字宽复制吗？

Yes; 是; on a 64-bit machine for nicely aligned data, it can be moving 64-bits at a time instead of 8-bits at a time; 在64位机器上，对于精确对齐的数据，它可以一次移动64位而不是一次移动8位; a 32-bit machine, likely 32-bits at a time. 一台32位的机器，一次可能是32位。 It also has ~~only one~~ a simpler test to make on each iteration (count), not ( ~~'count or is it null byte'~~ ) 'is this a null byte'. 它在每次迭代（计数）中~~只有一个~~更简单的测试，而不是（ ~~'计数或是空字节'~~ ）'这是一个空字节'。

Marcus: But memmove() is still working well even after I make L=L-13 , and sizeof(s) gives out L+1024-13 . Marcus：但是即使我使L=L-13 ，并且sizeof(s)给出L+1024-13 ， memmove()仍然运行良好。 My machine has a sizeof(int)==4 . 我的机器有一个sizeof(int)==4 。

The code for memmove() is highly optimized assembler, possibly inline (no function call overhead, though for 100KiB of data, the function call overhead is minimal). memmove()的代码是高度优化的汇编程序，可能是内联的（没有函数调用开销，但对于100KiB的数据，函数调用开销很小）。 The benefits are from the bigger moves and the simpler loop condition. 好处来自更大的动作和更简单的循环条件。

Marcus: So does Python use memmove() as well, or something magic? 马库斯：那么Python也使用memmove() ，还是魔术？

I've not looked at the Python source, but it is practically a certainty that it keeps track of the length of its strings (they're null terminated, but Python always knows how long the active part of the string is). 我没有看过Python源代码，但实际上它确实跟踪了它的字符串长度（它们是空终止的，但Python总是知道字符串的活动部分有多长）。 Knowing that length allows Python to use memmove() or memcpy() (the difference being that memmove() works correctly even if the source and destination overlap; memcpy() is not obliged to work correctly if they overlap). 知道这个长度允许Python使用memmove()或memcpy() （不同之处在于即使源和目标重叠， memmove()正常工作;如果重叠， memcpy()没有义务正常工作）。 It is relatively unlikely that they've got anything faster than memmove/memcpy available. 他们获得比memmove/memcpy更快的速度是相对不太可能的。

I modified the C code to produce more stable timings for me on my machine (Mac OS X 10.7.4, 8 GiB 1333 MHz RAM, 2.3 GHz Intel Core i7, GCC 4.7.1), and to compare strcpy() and strcat() vs memcpy() vs memmove() . 我修改了C代码，为我的机器（Mac OS X 10.7.4,8 GiB 1333 MHz RAM，2.3 GHz Intel Core i7，GCC 4.7.1）生成更稳定的时序，并比较strcpy()和strcat() vs memcpy() vs memmove() 。 Note that I increased the loop count from 1000 to 10000 to improve the stability of the timings, and I repeat the whole test (of all three mechanisms) 10 times. 请注意，我将循环计数从1000增加到10000以提高计时的稳定性，并且我重复整个测试（所有三种机制）10次。 Arguably, the timing loop count should be increased by another factor of 5-10 so that the timings are over a second. 可以说，定时循环计数应该增加另一个因子5-10，以便定时超过一秒。

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>

#define L (100*1024)

char s[L+1024];
char c[2*L+1024];

static double time_diff( struct timeval et, struct timeval st )
{
    return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}

static int foo(void)
{
    strcpy(c,s);
    strcat(c+L,s);
    return 0;
}

static int bar(void)
{
    memcpy(c + 0, s, L);
    memcpy(c + L, s, L);
    return 0;
}

static int baz(void)
{
    memmove(c + 0, s, L);
    memmove(c + L, s, L);
    return 0;
}

static void timer(void)
{
    struct timeval st;
    struct timeval et;
    int i;

    memset(s, '1', L);
    foo();

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        foo();
    gettimeofday(&et,NULL);
    printf("foo: %f\n", time_diff(et,st));

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        bar();
    gettimeofday(&et,NULL);
    printf("bar: %f\n", time_diff(et,st));

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        baz();
    gettimeofday(&et,NULL);
    printf("baz: %f\n", time_diff(et,st));
}

int main(void)
{
    for (int i = 0; i < 10; i++)
        timer();
    return 0;
}

That gives no warnings when compiled with: 编译时不会发出警告：

gcc -O3 -g -std=c99 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes \
    -Wold-style-definition cp100k.c -o cp100k

The timing I got was: 我得到的时间是：

foo: 1.781506
bar: 0.155201
baz: 0.144501
foo: 1.276882
bar: 0.187883
baz: 0.191538
foo: 1.090962
bar: 0.179188
baz: 0.183671
foo: 1.898331
bar: 0.142374
baz: 0.140329
foo: 1.516326
bar: 0.146018
baz: 0.144458
foo: 1.245074
bar: 0.180004
baz: 0.181697
foo: 1.635782
bar: 0.136308
baz: 0.139375
foo: 1.542530
bar: 0.138344
baz: 0.136546
foo: 1.646373
bar: 0.185739
baz: 0.194672
foo: 1.284208
bar: 0.145161
baz: 0.205196

What is weird is that if I forego 'no warnings' and omit the <string.h> and <stdio.h> headers, as in the original posted code, the timings I got are: 有点奇怪的是，如果我放弃“没有警告”并省略<string.h>和<stdio.h>标题，就像在原始发布的代码中一样，我得到的时间是：

foo: 1.432378
bar: 0.123245
baz: 0.120716
foo: 1.149614
bar: 0.186661
baz: 0.204024
foo: 1.529690
bar: 0.104873
baz: 0.105964
foo: 1.356727
bar: 0.150993
baz: 0.135393
foo: 0.945457
bar: 0.173606
baz: 0.170719
foo: 1.768005
bar: 0.136830
baz: 0.124262
foo: 1.457069
bar: 0.130019
baz: 0.126566
foo: 1.084092
bar: 0.173160
baz: 0.189040
foo: 1.742892
bar: 0.120824
baz: 0.124772
foo: 1.465636
bar: 0.136625
baz: 0.139923

Eyeballing those results, it seems to be faster than the 'cleaner' code, though I've not run a Student's t-Test on the two sets of data, and the timings have very substantial variability (but I do have things like Boinc running 8 processes in the background). 考虑到这些结果，它似乎比“更干净”的代码更快，虽然我没有对两组数据进行学生t检验，并且时间具有非常大的可变性（但我确实有像Boinc运行的东西后台的8个流程）。 The effect seemed to be more pronounced in the early versions of the code, when it was just strcpy() and strcat() that was tested. 在早期版本的代码中，当它只是strcpy()和strcat()被测试时，效果似乎更加明显。 I have no explanation for that, if it is a real effect! 我没有解释，如果它是真正的效果！

Followup by mvds 跟进mvds

Since the question was closed I cannot answer properly. 由于问题已经结束，我无法正确回答。 On a Mac doing virtually nothing, I get these timings: 在几乎没有做任何事情的Mac上，我得到了这些时间：

(with headers) （带标题）

foo: 1.694667 bar: 0.300041 baz: 0.301693
foo: 1.696361 bar: 0.305267 baz: 0.298918
foo: 1.708898 bar: 0.299006 baz: 0.299327
foo: 1.696909 bar: 0.299919 baz: 0.300499
foo: 1.696582 bar: 0.300021 baz: 0.299775

(without headers, ignoring warnings) （没有标题，忽略警告）

foo: 1.185880 bar: 0.300287 baz: 0.300483
foo: 1.120522 bar: 0.299585 baz: 0.301144
foo: 1.122017 bar: 0.299476 baz: 0.299724
foo: 1.124904 bar: 0.301635 baz: 0.300230
foo: 1.120719 bar: 0.300118 baz: 0.299673

Preprocessor output ( -E flag) shows that including the headers translates strcpy into builtin calls like: 预处理器输出（ -E标志）显示包含标头将strcpy转换为内置调用，如：

((__builtin_object_size (c, 0) != (size_t) -1) ? __builtin___strcpy_chk (c, s, __builtin_object_size (c, 2 > 1)) : __inline_strcpy_chk (c, s));
((__builtin_object_size (c+(100*1024), 0) != (size_t) -1) ? __builtin___strcat_chk (c+(100*1024), s, __builtin_object_size (c+(100*1024), 2 > 1)) : __inline_strcat_chk (c+(100*1024), s));

So the libc version of strcpy outperforms the gcc builtin. 所以strcpy的libc版本优于gcc内置版。 (using gdb it is easily verified that a breakpoint on strcpy indeed doesn't break on the strcpy() call, if the headers are included) （使用gdb很容易验证strcpy上的断点确实不会在strcpy()调用中断开，如果包含头文件的话）

On Linux (Debian 5.0.9, amd64), the differences seem to be negligible. 在Linux（Debian 5.0.9，amd64）上，差异似乎可以忽略不计。 The generated assembly ( -S flag) only differs in debugging information carried by the includes. 生成的程序集（ -S标志）仅在包含的调试信息中有所不同。

Answer 2

I believe the reason for this is that Python strings are not null-terminated. 我相信这样做的原因是Python字符串不是以null结尾的。

in Python the string length is stored alongside the string, allowing it to skip the implicit strlen() used by strcat() when concatenating strings. 在Python中，字符串长度与字符串一起存储，允许它在连接字符串时跳过strcat（）使用的隐式strlen（）。

Adding in the fact that string concatenation is implemented directly in C for Python is probably the cause. 添加字符串连接直接在C中实现Python的事实可能就是原因。

Edit: well, now that I actually look at the C code and see that it uses static buffers, I'm mystified as well, as I don't see how Python could avoid dynamic allocations which should be much slower... 编辑：好吧，现在我实际上看了C代码，看到它使用静态缓冲区，我也很神秘，因为我没有看到Python如何避免动态分配，这应该慢得多......

连接两个字符串时，为什么Python比C更快？

问题描述

2 个解决方案

解决方案1
18 已采纳 2012-09-14 18:14:37

解决方案2
7 2012-09-14 17:15:27

连接两个字符串时，为什么Python比C更快？

问题描述

2 个解决方案

解决方案1 18 已采纳 2012-09-14 18:14:37

解决方案2 7 2012-09-14 17:15:27

解决方案1
18 已采纳 2012-09-14 18:14:37

解决方案2
7 2012-09-14 17:15:27