简体   繁体   English

为什么列表的连接需要O(n)?

[英]Why does concatenation of lists take O(n)?

According to the theory of ADTs (Algebraic Data Types) the concatenation of two lists has to take O(n) where n is the length of the first list. 根据ADT(代数数据类型)理论,两个列表的串联必须采用O(n) ,其中n是第一个列表的长度。 You, basically, have to recursively iterate through the first list until you find the end. 基本上,您必须递归遍历第一个列表,直到找到结束。

From a different point of view, one can argue that the second list can simply be linked to the last element of the first. 从不同的角度来看,可以说第二个列表可以简单地链接到第一个元素的最后一个元素。 This would take constant time, if the end of the first list is known. 如果知道第一个列表的结尾,这将花费恒定的时间。

What am I missing here ? 我在这里错过了什么?

Operationally, an Haskell list is typically represented by a pointer to the first cell of a single-linked list (roughly). 在操作上,Haskell列表通常由指向单链表的第一个单元的指针(粗略地)表示。 In this way, tail just returns the pointer to the next cell (it does not have to copy anything), and consing x : in front of the list allocates a new cell, makes it point to the old list, and returns the new pointer. 通过这种方式, tail只返回指向下一个单元格的指针(它不必复制任何东西),并且在列表前面使用x :分配一个新单元格,使其指向旧列表,并返回新指针。 The list accessed by the old pointer is unchanged, so there's no need to copy it. 旧指针访问的列表未更改,因此无需复制它。

If you instead append a value with ++ [x] , then you can not modify the original liked list by changing its last pointer unless you know that the original list will never be accessed. 如果您改为使用++ [x]附加一个值,那么除非您知道永远不会访问原始列表,否则您无法通过更改其最后一个指针来修改原始首选列表。 More concretely, consider 更具体地说,考虑一下

x = [1..5]
n = length (x ++ [6]) + length x

If you modify x when doing x++[6] , the value of n would turn up to be 12, which is wrong. 如果在执行x++[6]时修改x ,则n的值将变为12,这是错误的。 The last x refer to the unchanged list which has length 5 , so the result of n must be 11. 最后一个x指的是长度为5的未更改列表,因此n的结果必须为11。

Practically, you can't expect the compiler to optimize this, even in those cases in which x is no longer used and it could, theoretically, be updated in place (a "linear" use). 实际上,您不能指望编译器对此进行优化,即使在不再使用x情况下,理论上也可以在适当的位置更新(“线性”使用)。 What happens is that the evaluation of x++[6] must be ready for the worst-case in which x is reused afterwards, and so it must copy the whole list x . 发生的事情是x++[6]的评估必须为最坏情况做好准备,其中x后来被重用,因此它必须复制整个列表x

As @Ben notes, saying "the list is copied" is imprecise. 正如@Ben所说,“列表被复制”是不精确的。 What actually happens is that the cells with the pointers are copied (the so-called "spine" on the list), but the elements are not. 实际发生的是具有指针的单元格被复制(列表中所谓的“脊柱”),但元素不是。 For instance, 例如,

x = [[1,2],[2,3]]
y = x ++ [[3,4]]

requires only to allocate [1,2],[2,3],[3,4] once . 只需要分配[1,2],[2,3],[3,4] 一次 The lists of lists x,y will share pointers to the lists of integers, which do not have to be duplicated. 列表x,y列表将共享指向整数列表的指针,这些指针不必重复。

What you're asking for is related to a question I wrote for TCS Stackexchange some time back: the data structure that supports constant-time concatenation of functional lists is a difference list . 您要求的是与我在一段时间内为TCS Stackexchange编写的问题有关:支持功能列表的常量时间连接的数据结构是一个差异列表

A way of handling such lists in a functional programming language was worked out by Yasuhiko Minamide in the 90s ; Yasuhiko Minamide在90年代制定了一种以函数式编程语言处理这种列表的方法; I effectively rediscovered it a while back. 我有一次有效地重新发现了它 However, the good run-time guarantees require language-level support that's not available in Haskell. 但是,良好的运行时保证需要H​​askell中不可用的语言级支持。

It's because of immutable state. 这是因为不可改变的状态。 A list is an object + a pointer, so if we imagined a list as a Tuple it might look like this: 列表是一个对象+一个指针,所以如果我们将列表想象为一个元组,它可能看起来像这样:

let tupleList = ("a", ("b", ("c", [])))

Now let's get the first item in this "list" with a "head" function. 现在让我们使用“head”函数获取此“列表”中的第一个项目。 This head function takes O(1) time because we can use fst: 这个头函数需要O(1)时间,因为我们可以使用fst:

> fst tupleList

If we want to swap out the first item in the list with a different one we could do this: 如果我们想要将列表中的第一项替换为另一项,我们可以这样做:

let tupleList2 = ("x",snd tupleList)

Which can also be done in O(1). 这也可以在O(1)中完成。 Why? 为什么? Because absolutely no other element in the list stores a reference to the first entry. 因为列表中绝对没有其他元素存储对第一个条目的引用。 Because of immutable state, we now have two lists, tupleList and tupleList2 . 由于不可变状态,我们现在有两个列表, tupleListtupleList2 When we made tupleList2 we didn't copy the whole list. 当我们制作tupleList2我们没有复制整个列表。 Because the original pointers are immutable we can continue to reference them but use something else at the start of our list. 因为原始指针是不可变的,所以我们可以继续引用它们,但在列表的开头使用其他东西。

Now let's try to get the last element of our 3 item list: 现在让我们尝试获取3个项目列表的最后一个元素:

> snd . snd $ fst tupleList

That happened in O(3), which is equal to the length of our list ie O(n). 这发生在O(3)中,它等于我们列表的长度,即O(n)。

But couldn't we store a pointer to the last element in the list and access that in O(1)? 但是我们不能存储指向列表中最后一个元素的指针并在O(1)中访问它吗? To do that we would need an array, not a list. 要做到这一点,我们需要一个数组,而不是一个列表。 An array allows O(1) lookup time of any element as it is a primitive data structure implemented on a register level. 数组允许任何元素的O(1)查找时间,因为它是在寄存器级别上实现的原始数据结构。

(ASIDE: If you're unsure of why we would use a Linked List instead of an Array then you should do some more reading about data structures, algorithms on data structures and Big-O time complexity of various operations like get, poll, insert, delete, sort, etc). (ASIDE:如果你不确定为什么我们会使用链接列表而不是数组,那么你应该做更多关于数据结构,数据结构算法和各种操作的Big-O时间复杂性的阅读,比如get,poll,insert ,删除,排序等)。

Now that we've established that, let's look at concatenation. 现在我们已经建立了这个,让我们来看看连接。 Let's concat tupleList with a new list, ("e", ("f", [])) . 让我们用新列表("e", ("f", [])) tupleList To do this we have to traverse the whole list just like getting the last element: 要做到这一点,我们必须遍历整个列表,就像获取最后一个元素:

tupleList3 = (fst tupleList, (snd $ fst tupleList, (snd . snd $ fst tupleList, ("e", ("f", [])))

The above operation is actually worse than O(n) time, because for each element in the list we have to re-read the list up to that index. 上面的操作实际上比O(n)时间更糟 ,因为对于列表中的每个元素,我们必须重新读取列表到该索引。 But if we ignore that for a moment and focus on the key aspect: in order to get to the last element in the list, we must traverse the entire structure. 但是如果我们暂时忽略它并关注关键方面:为了到达列表中的最后一个元素,我们必须遍历整个结构。

You may be asking, why don't we just store in memory what the last list item is? 您可能会问,为什么我们不在内存中存储最后一个列表项? That way appending to the end of the list would be done in O(1). 附加到列表末尾的那种方式将在O(1)中完成。 But not so fast, we can't change the last list item without changing the entire list. 但不是那么快,我们无法在不更改整个列表的情况下更改最后一个列表项。 Why? 为什么?

Let's take a stab at how that might look: 让我们来看看它的外观:

data Queue a = Queue { last :: Queue a, head :: a, next :: Queue a} | Empty
appendEnd :: a -> Queue a -> Queue a
appendEnd a2 (Queue l, h, n) = ????

IF I modify "last", which is an immutable variable, I won't actually be modifying the pointer for the last item in the queue. 如果我修改“last”,这是一个不可变的变量,我实际上不会修改队列中最后一项的指针。 I will be creating a copy of the last item. 我将创建最后一项的副本。 Everything else that referenced that original item, will continue referencing the original item. 引用该原始项目的所有其他内容将继续引用原始项目。

So in order to update the last item in the queue, I have to update everything that has a reference to it. 因此,为了更新队列中的最后一项,我必须更新所有引用它的内容。 Which can only be done in optimally O(n) time. 这只能在最佳O(n)时间内完成。

So in our traditional list, we have our final item: 所以在我们的传统列表中,我们有最终项目:

List a []

But if we want to change it, we make a copy of it. 但是如果我们想要改变它,我们会复制它。 Now the second last item has a reference to an old version. 现在,倒数第二个项目引用了旧版本。 So we need to update that item. 所以我们需要更新该项目。

List a (List a [])

But if we update the second last item we make a copy of it. 但如果我们更新第二个项目,我们会复制它。 Now the third last item has an old reference. 现在第三个最后一项有一个旧的参考。 So we need to update that. 所以我们需要更新它。 Repeat until we get to the head of the list. 重复,直到我们到达列表的头部。 And we come full circle. 我们走了一圈。 Nothing keeps a reference to the head of the list so editing that takes O(1). 没有任何东西保留对列表头部的引用,因此编辑需要O(1)。

This is the reason that Haskell doesn't have Doubly Linked Lists. 这就是Haskell没有双链表的原因。 It's also why a "Queue" (or at least a FIFO queue) can't be implemented in a traditional way. 这也是无法以传统方式实现“队列”(或至少FIFO队列)的原因。 Making a Queue in Haskell involves some serious re-thinking of traditional data structures. 在Haskell中创建队列需要对传统数据结构进行一些认真的重新思考。

If you become even more curious about how all of this works, consider getting the book Purely Funtional Data Structures . 如果您对所有这些工作方式变得更加好奇,请考虑使用Purely Funtional Data Structures这本书。

EDIT: If you've ever seen this: http://visualgo.net/list.html you might notice that in the visualization "Insert Tail" happens in O(1). 编辑:如果您曾经见过这个: http//visualgo.net/list.html您可能会注意到可视化“插入尾部”发生在O(1)中。 But in order to do that we need to modify the final entry in the list to give it a new pointer. 但是为了做到这一点,我们需要修改列表中的最后一个条目以给它一个新的指针。 Updating a pointer mutates state which is not allowed in a purely functional language. 更新指针会改变纯功能语言中不允许的状态。 Hopefully that was made clear with the rest of my post. 希望我的帖子的其余部分清楚地表明了这一点。

In order to concatenate two lists (call them xs and ys ), we need to modify the final node in xs in order to link it to (ie point at) the first node of ys . 为了连接两个列表(称为xsys ),我们需要修改xs中的最终节点,以便将它链接到(即指向) ys的第一个节点。

But Haskell lists are immutable, so we have to create a copy of xs first. 但是Haskell列表是不可变的,所以我们必须先创建一个xs的副本。 This operation is O(n) (where n is the length of xs ). 该操作是O(n) (其中nxs的长度)。

Example: 例:

xs
|
v
1 -> 2 -> 3

1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
^              ^
|              |
xs ++ ys       ys

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM