简体   繁体   English

为什么python中字符串连接的顺序会大大影响速度?

[英]Why does the order of string concatenation in python affect speed greatly?

I just found this issue through debugging my code. 我刚刚通过调试代码发现了这个问题。 I had a list of messages as strings that I was trying to concatenate together, and I wanted to add a newline to the end of every message. 我有一个消息列表作为字符串,我试图连接在一起,我想在每条消息的末尾添加一个换行符。

Approach 1: 方法1:

total_str = ""
for m in messages:
    total_str = total_str + m + "\n"

This was extremely slow - after around the 100,000th message, adding each message took about 2-3s, and around the 300,000th message, this process basically stopped. 这非常慢 - 在大约第100,000条消息之后,添加每条消息需要大约2-3秒,并且在第300,000条消息附近,此过程基本上停止了。

Approach 2: 方法2:

total_str = ""
for m in messages:
    tmp = m + "\n"
    total_str = total_str + tmp

This approach finished concatenating all 1.6 million messages in less than a second. 这种方法在不到一秒的时间内完成了所有160万条消息的连接。

What I'm wondering is why is the second approach so much faster than the first? 我想知道为什么第二种方法比第一种方法快得多?

a + b + c isn't a single operation that joins a , b , and c into a single string. a + b + c不是将abc成单个字符串的单个操作。 It is two operations, t = a + b and t + c , which means copying the contents of a twice ; 它是两个操作, t = a + bt + c ,这意味着复制的内容a 两次 ; once to copy a into t , and again when t gets copied into the result of t + c . 一次将a复制到t ,并在t被复制到t + c的结果时再次复制。 Since, in your example, a is the string that keeps getting longer, you are at best doubling the amount of data being copied at each step. 因为在你的例子中, a是一个不断变长的字符串,所以最多可以使每一步复制的数据量翻倍。

The best approach is to avoid all the temporary str object created by + , and use join : 最好的方法是避免由+创建的所有临时str对象,并使用join

total_str = "\n".join(messages)

join operates with each string directly, without the need to iteratively append them to an initial empty string one at a time. join直接对每个字符串进行操作,而无需一次迭代地将它们附加到一个初始空字符串。 join figures out, by scanning messages , how long the resulting string needs to be, allocates enough memory for it, then sequentially copies the data from each element of messages into place one at a time. join的数字出来,通过扫描messages ,所得到的字符串需要多长时间是,分配足够的存储器给它,然后从各元件的数据依次复制messages在时间到位之一。

Well, since a = a + b + c is executed as a = (a + b) + c , one can see that the order of computation is the following: 那么,因为a = a + b + c被执行为a = (a + b) + c ,所以可以看出计算的顺序如下:

  • tmp_1 = a + b . tmp_1 = a + b This has to copy the huge string a because strings are immutable. 这必须复制庞大的字符串a因为字符串是不可变的。
  • a = tmp_1 + c . a = tmp_1 + c This has to copy the (even more) huge string tmp_1 because strings are immutable. 这必须复制(甚至更多)巨大的字符串tmp_1因为字符串是不可变的。

So, there are two huge copies involved, while in the second version, a = a + tmp (like in your second example), only one such copy is needed. 因此,涉及两个大型副本,而在第二个版本中, a = a + tmp (如第二个示例中所示),只需要一个这样的副本。 The latter approach will obviously be faster. 后一种方法显然会更快。

Python's strings are immutable and contiguous. Python的字符串是不可变的和连续的。 The former means they can't be modified, and the latter means they're stored in one place in memory. 前者意味着它们不能被修改,而后者意味着它们被存储在内存中的一个位置。 This is unlike eg a rope data structure , where appending data is a cheap operation that need only form a new node for the end. 这与例如绳索数据结构不同 ,其中附加数据是廉价操作,其仅需要为末端形成新节点。 It means that the concatenation operation must copy both input strings each time, and with something like total_str = total_str + m + "\\n" , since + is left associative , copies all of total_str twice. 这意味着连接操作必须每次复制两个输入字符串,并且使用诸如total_str = total_str + m + "\\n" ,因为+左关联的 ,所有的total_str都复制两次。 The usual solution is keeping all the small strings until the whole set is completed, and using str.join to perform the concatenations in one pass. 通常的解决方案是保留所有小字符串直到整个集合完成,并使用str.join在一次传递中执行连接。 This would only copy each component string once, instead of a geometric (proportional to square) number of times. 这只会复制每个组件字符串一次,而不是几何(与方形成比例)次数。 Another option, to build a buffer as you go along, is to use io.StringIO . 另外一个选择是在使用io.StringIO构建缓冲区。 That will give you a file-like object, a bit like a StringBuilder in some other languages, from which you can extract the final string. 这将为您提供类似文件的对象,有点像其他语言的StringBuilder ,您可以从中提取最终的字符串。 We also have operations like writelines that can accept iterables, so the join may not be needed at all. 我们还有可以接受迭代的writelines等操作,因此可能根本不需要连接。

My guess as for why the second implementation managed to be so much faster (not just about twice as fast), is that there are optimizations in place that can sometimes permit CPython not to perform the copy of the left operand at all. 我的猜测是为什么第二个实现变得如此快(不仅仅是快两倍),是有优化的地方,有时候允许CPython根本不执行左操作数的副本。 PyUnicode_Append appears to have precisely such an optimization, based on unicode_modifiable , wherein it can mutate an object if the reference count is precisely 1, the string has never been hashed, and a few other conditions. PyUnicode_Append似乎具有基于unicode_modifiable精确优化,其中如果引用计数精确为1,则字符串从未被哈希,以及其他一些条件,它可以改变对象。 This would typically apply to a local variable where you use += , and presumably the compiler managed to generate such behaviour when there wasn't a second operator in the same assignment. 这通常适用于您使用+=的局部变量,并且可能是当同一分配中没有第二个运算符时,编译器设法生成此类行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM