简体   繁体   English

我怎么知道 Python 是否在 memory 中创建了一个新的子列表:`for item in nums[1:]`

[英]How would I know if Python creates a new sublist in memory for: `for item in nums[1:]`

I'm not asking for an answer to the question, but rather how I, on my own, could have gotten the answer.我不是要问题的答案,而是我自己如何得到答案。

Original Question:原始问题:

Does the following code cause Python to make a new list of size (len(nums) - 1) in memory that then gets iterated over?以下代码是否会导致 Python 在 memory 中生成一个新的大小列表 (len(nums) - 1) 然后对其进行迭代?

for item in nums[1:]:
   # do stuff with item

Original Answer原始答案

A similarish question is asked here and there is a subcomment by Srinivas Reddy Thatiparthy saying that a new sublist is created. 这里提出了一个类似的问题,Srinivas Reddy Thatiparthy 有一个子评论说创建了一个新的子列表。 But, there is no detail given about how he arrived at this answer, which I think makes it very different from what I'm looking for.但是,没有详细说明他是如何得出这个答案的,我认为这与我正在寻找的答案大不相同。

Question:问题:

How could I have figured out on my own what the answer to my question is?我怎么能自己弄清楚我的问题的答案是什么?

I've had similar questions before.我以前有过类似的问题。 For instance, I learned that if I do my_function(nums[1:]) , I don't pass in a "slice" but rather a completely new, different sublist!例如,我了解到,如果我执行my_function(nums[1:]) ,我不会传入“切片”,而是传入一个全新的、不同的子列表! I found this out by just testing whether the original list passed into my_function was modified post-function (it wasn't).我通过测试传递给my_function的原始列表是否在函数后修改(不是)来发现这一点。 But I don't see an immediate way to figure out if Python is making a new sublist for the for loop example.但我没有看到立即确定 Python 是否正在为for循环示例创建新子列表的方法。 Please help me to know how to do this.请帮助我知道如何做到这一点。

side note边注

By the way, this is the current solution I'm using from the original stackoverflow post solutions:顺便说一下,这是我从原始 stackoverflow 发布解决方案中使用的当前解决方案:

for indx, item in enumerate(nums):
    if indx == 0:
       continue 
    # do stuff w items 

In general, the easy way to learn if you have a new chunk of data or just a new reference to an existing chunk of data is to modify the data through one reference, and then see if it is also modified through the other.通常,了解是否有新数据块或只是对现有数据块的新引用的简单方法是通过一个引用修改数据,然后查看是否也通过另一个引用修改了数据。 (It sounds like that's "the hard way" you did, but I would recommend it as a general technique.) Some psuedocode would look like: (这听起来像是你所做的“艰难的方式”,但我会推荐它作为一种通用技术。)一些伪代码看起来像:

function areSameRef(thing1, thing2){
    thing1.modify()
    return thing1.equals(thing2) //make sure this is not just a referential equality check
}

It is very rare that this will fail, and essentially requires behind-the-scenes optimizations where data isn't cloned immediately but only when modified.这很少会失败,并且本质上需要进行幕后优化,其中不会立即克隆数据,而是仅在修改时才进行克隆。 In this case the fact that the underlying data is the same is being hidden from you, and in most cases, you should just trust that whoever did the hiding knows what they're doing.在这种情况下,底层数据相同这一事实对您是隐藏的,在大多数情况下,您应该相信隐藏的人知道他们在做什么。 Exceptions are if they did it wrong, or if you have some complex performance issues.例外情况是他们做错了,或者您遇到了一些复杂的性能问题。 For that you may need to turn to more language-specific debugging or profiling tools.为此,您可能需要转向更多特定于语言的调试或分析工具。 (See below for more) (更多内容见下文)

Do also be careful about cases where part of the data may be shared - for instance, look up cons lists and tail sharing.还要注意可能共享部分数据的情况——例如,查找缺点列表和尾部共享。 In those cases if you do something like:在这些情况下,如果您执行以下操作:

function foo(list1, list2){
   list1.append(someElement)
   return list1.length == list2.length
}

will return false - the element is only added to the first list, but something like将返回 false - 该元素仅添加到第一个列表中,但类似

function bar(list1, list2){
    list1.set(someIndex, someElement)
    return list1.get(someIndex)==list2.get(someIndex)
}

will return true (though in practice, lists created that way usually don't have an interface that allows mutability.)将返回 true(尽管在实践中,以这种方式创建的列表通常没有允许可变性的接口。)

I don't see a question in part 2, but yes, your conclusion looks valid to me.我在第 2 部分中没有看到问题,但是是的,你的结论对我来说似乎是正确的。

EDIT: More on actual memory usage编辑:更多关于实际 memory 用法

As you pointed out, there are situations where that sort of test won't work because you don't actually have two references, as in the for i in [nums 1:] case.正如您所指出的,在某些情况下,这种测试将不起作用,因为您实际上没有两个引用,例如for i in [nums 1:]情况。 In that case I would say turn to a profiler, but you couldn't really trust the results.在那种情况下,我会说转向分析器,但你不能真正相信结果。

The reason for that comes down to how compilers/interpreters work, and the contract they fulfill in the language specification.其原因归结为编译器/解释器的工作方式,以及它们在语言规范中履行的合同。 The general rule is that the interpreter is allowed to re-arrange and modify the execution of your code in any way that does not change the results, but may change the memory or time performance.一般规则是允许解释器以任何不改变结果但可能改变 memory 或时间性能的方式重新安排和修改代码的执行。 So, if the state of your code and all the I/O are the same, it should not be possible for foo(5) to return 6 in one interpreter implementation/execution and 7 in another, but it is valid for them to take very different amounts of time and memory.因此,如果您的代码的 state 和所有 I/O 都相同,则foo(5)不可能在一个解释器实现/执行中返回6而在另一个解释器实现/执行中返回7 ,但对它们来说是有效的非常不同的时间量和 memory。

This matters because a lot of what interpreters and compilers do is behind-the-scenes optimizations;这很重要,因为解释器和编译器所做的很多事情都是幕后优化; they will try to make your code run as fast as possible and with as small a memory footprint as possible, so long as the results are the same.他们将尝试使您的代码运行得尽可能快,占用空间尽可能小(memory),只要结果相同即可。 However, it can only do so when it can prove that the changes will not modify the results.但是,只有在能够证明这些变化不会改变结果的情况下,它才能这样做。

This means that if you write a simple test case, the interpreter may optimize it behind the scenes to minimize the memory usage and give you one result - "no new list is created."这意味着,如果您编写一个简单的测试用例,解释器可能会在幕后对其进行优化,以最大限度地减少 memory 的使用,并给您一个结果——“没有创建新列表”。 But, if you try to trust that result in real code, the real code may be too complex for the compiler to tell if the optimization is safe, and it may fail.但是,如果您试图相信真实代码中的结果,真实代码可能过于复杂,编译器无法判断优化是否安全,并且可能会失败。 It can also depend upon the specific interpreter version, environmental variables or available hardware resources.它还可以取决于特定的解释器版本、环境变量或可用的硬件资源。

Here's an example:这是一个例子:

def foo(x : int):
    l = range(9999)
    return 5

def bar(x:int):
    l = range(9999)
    if (x + 1 != (x*2+2)/2):
      return l[x]
    else:
      return 5

I can't promise this for any particular language, but I would usually expect foo and bar to have much different memory usages.我不能 promise 对于任何特定语言,但我通常希望foobar有很大不同的 memory 用法。 In foo , any moderately-well-created interpreter should be able to tell that l is never referenced before it goes out of scope, and thus can freely skip actually allocating any memory at all as a safe operation.foo中,任何创建良好的解释器都应该能够判断l在离开 scope 之前从未被引用,因此可以自由地跳过实际分配任何 memory 作为安全操作。 In bar (unless I failed at arithmetic), l will never be used either - but knowing that requires some reasoning about the condition of the if statement.bar中(除非我算术不及格), l也永远不会被使用——但知道这需要对 if 语句的条件进行一些推理。 It takes a much smarter interpreter to recognize that, so even though these two code snippets might look the same logically, they can have very different behind-the-scenes performances.需要更聪明的解释器才能认识到这一点,因此即使这两个代码片段在逻辑上看起来可能相同,但它们在幕后的表现可能截然不同。

EDIT: As has been pointed out to my, Python specifically may not be able to optimize either of these, given the dynamic nature of the language;编辑:正如我所指出的,鉴于语言的动态特性,Python 特别可能无法优化其中任何一个; the range function and the list type may both have been re-assigned or altered from elsewhere in the code. range function 和list类型可能都已从代码的其他地方重新分配或更改。 Without specific expertise in the python optimization world I can't say what they do or don't do.没有 python 优化领域的具体专业知识,我不能说他们做什么或不做什么。 Anyway I'm leaving this here for edification on the general concept of optimizations, but take my error as a case lesson in "reasoning about optimization is hard".无论如何,我将此留在这里是为了对优化的一般概念进行启发,但将我的错误作为“关于优化的推理很难”的案例课程。

All of that being said: FWIW, I strongly suspect that the python interpreter is smart enough to recognize that for i in nums[1:] should not actually allocate new memory, but just iterate over a slice.综上所述:FWIW,我强烈怀疑 python 解释器足够聪明,可以识别for i in nums[1:]实际上不应该分配新的 memory,而只是迭代一个切片。 That looks to my eyes to be a relatively simple, safe and valuable transformation on a very common use case, so I would expect the (highly optimized) python interpreter to handle it.在我看来,这是一个非常常见用例的相对简单、安全且有价值的转换,因此我希望(高度优化的)python 解释器能够处理它。

EDIT2: As a final (opinionated) note, I'm less confident about that in Python than I am in almost any other language, because Python syntax is so flexible and allows so many strange things. EDIT2:作为最后的(自以为是的)注释,我对 Python 的信心不如我对几乎任何其他语言的信心,因为 Python 语法非常灵活并且允许很多奇怪的事情。 This makes it much more difficult for the python interpreter (or a human, for that matter) to say anything with confidence, because the space of "legal python code" is so large.这使得 python 解释器(或人类)更难自信地说出任何内容,因为“合法的 python 代码”的空间太大了。 This is a big part of why I prefer much stricter languages like Rust, which force the programmer to color inside the lines but result in much more predictable behaviors.这就是为什么我更喜欢更严格的语言(如 Rust)的一个重要原因,它迫使程序员在行内着色,但会导致更可预测的行为。

EDIT3: As a post-final note, usually for things like this it's best to trust that the execution environment is handling these sorts of low-level optimizations. EDIT3:作为最后的注释,通常对于这样的事情,最好相信执行环境正在处理这些低级优化。 Nine times out of ten, don't try to solve this kind of performance problem until something actually breaks.十分之九,在某些东西真正崩溃之前,不要尝试解决这种性能问题。

As for knowing how list slice works, from the language reference Sequence Types — list, tuple, range , we know that至于了解列表切片的工作原理,从语言参考Sequence Types — list, tuple, range ,我们知道

s[i:j] - The slice of s from i to j is defined as the sequence of items with index k such that i <= k < j. s[i:j] - s 从 i 到 j 的切片被定义为索引为 k 的项目序列,使得 i <= k < j。

So, the slice creates a new sequence but we don't know whether that sequence is a list or whether there is some clever way that the same list object somehow represents both of these sequences.因此,切片创建了一个新序列,但我们不知道该序列是否是一个列表,或者是否有一些巧妙的方法使同一个列表 object 以某种方式表示这两个序列。 That's not too surprising with the python language spec where lists are described as part of the general discussion of sequences and the spec never really tries to cover all of the details for object implementation.对于 python 语言规范,这并不奇怪,其中列表被描述为序列一般讨论的一部分,并且该规范从未真正尝试涵盖 object 实现的所有细节。

That's because in the end, something like nums[1:] is really just syntactic sugar for nums.__getitem__(slice(1, None)) , meaning that lists get to decide for themselves what slicing means.那是因为最后,像nums[1:]这样的东西实际上只是nums.__getitem__(slice(1, None))的语法糖,这意味着列表可以自己决定切片的含义。 And you need to go to the source for the implementation.并且你需要go到源码来实现。 See the list_subscript function in listobject.c .请参阅list_subscript中的list_subscript function

But we can experiment.但我们可以试验。 Looking at the doucmentation for The for statement ,查看for 语句的文档,

for_stmt::= "for" target_list "in" starred_list ":" suite ["else" ":" suite] The starred_list expression is evaluated once; for_stmt::= "for" target_list "in" starred_list ":" suite ["else" ":" suite] starred_list 表达式求值一次; it should yield an iterable object.它应该产生一个可迭代的 object。

So, nums[1:] is an expression that must yield an iterable object and we can assign that object to an intermediate variable.因此, nums[1:]是一个必须产生可迭代 object 的表达式,我们可以将 object 分配给一个中间变量。

nums = [1 ,2, 3]
tmp = nums[1:]
for item in tmp:
    pass

tmp[0] = "new stuff"

assert id(nums) != id(tmp), "List slice creates a new object"
assert type(tmp) == type(nums), "List slice creates a new list"
assert 999 not in nums, "List slice doesn't affect original"

Run that, and if neither assertion error is raised, you know that a new list was created.运行它,如果没有出现任何断言错误,您就知道创建了一个新列表。

Other sequence-like objects may work radically different.其他类似序列的对象可能会完全不同。 In a numpy array, for instance, two array objects may indeed reference the same memory. In this example, that final assert will be raised because the slice is another view into the same array.例如,在一个 numpy 数组中,两个数组对象可能确实引用相同的 memory。在这个例子中,最后的断言将被引发,因为切片是同一数组的另一个视图。 Yes, this can keep you up all night.是的,这会让你彻夜难眠。

import numpy as np

nums = np.array([1,2,3])
tmp = nums[1:]
for item in tmp:
    pass

tmp[0] = 999

assert id(nums) != id(tmp), "array slice creates a new object"
assert type(tmp) == type(nums), "array slice creates a new list"
assert 999 not in nums, "array slice doesn't affect original"

You can use the new Walrus operator := to capture the temporary object created by Python for the slice.您可以使用新的 Walrus 运算符:=来捕获由 Python 为切片创建的临时 object。 A little investigation demonstrates that they aren't the same object.一点调查表明它们不是相同的 object。

import sys
print(sys.version)

a = list(range(1000))
for i in (b := a[1:]):
    b[0] = 906
print(b is a)
print(a[:10])
print(b[:10])
print(sys.getsizeof(a))
print(sys.getsizeof(b))

Generates the following output:生成以下 output:

3.11.0 (main, Nov  4 2022, 00:14:47) [GCC 7.5.0]
False
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[906, 2, 3, 4, 5, 6, 7, 8, 9, 10]
8056
8048

See for yourself on the Godbolt Compiler Explorer where you can also see the compiler generated code.Godbolt Compiler Explorer上亲自查看,您还可以在其中查看编译器生成的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM