简体繁体 English

从 4 位整数流中查找中值的空间使用下限。为什么是“log n”位？

[英]Lower bound of space usage for finding median from stream of 4-bit integers. Why is it 'log n' bits?

原文 2020-09-25 04:39:48 2 1 algorithm/ stream/ space-complexity

Today, during the class (algorithm class), professor said that the lower bound of space usage (in bits) for finding median from stream of n 4-bit integers is log n.今天，在课堂（算法课）中，教授说从 n 个 4 位整数流中找到中位数的空间使用量（以位为单位）的下限是 log n。 Any idea why this is true?知道为什么这是真的吗？

1 个解决方案

Intuitively, Θ(log n) bits is enough to write out how many times you've seen each of the 16 possible 4-bit values, and from there you could compute the median.直观地说，Θ(log n) 位足以写出您看到 16 个可能的 4 位值中的每一个的次数，然后您可以从中计算出中位数。 The intuition behind why you can't (asymptotically) improve upon this is the idea that if you use fewer bits, you can't even remember how many times you've seen each of the numbers, so you can't always return the median.为什么你不能（渐近地）改进这一点背后的直觉是，如果你使用更少的位，你甚至不记得你看过每个数字多少次，所以你不能总是返回中位数。

The gist of the formal argument I'm about to make here is the following.我将在这里进行的正式论证的要点如下。 Imagine I stream the first half of my input into your algorithm.想象一下，我将输入的前半部分流式传输到您的算法中。 If you don't have enough bits of memory, you can't uniquely remember what that input was.如果您没有足够的内存位，您将无法唯一地记住该输入是什么。 And if you can't remember what that input was, then I can force your algorithm to give the wrong answer by maliciously choosing the back half of the sequence of numbers.如果你不记得那个输入是什么，那么我可以通过恶意选择数字序列的后半部分来强制你的算法给出错误的答案。

To formalize this, let's suppose that you have an algorithm that claims to solve this problem using o(log n) (that's little-o of log n, by the way) bits of memory.为了形式化这一点，让我们假设您有一个算法声称可以使用 o(log n)（顺便说一下，这是 log n 的小 o）位内存来解决这个问题。 Now, suppose I have a "sufficiently large" stream of n = 2k + 1 numbers, each that's four bits long.现在，假设我有一个“足够大”的 n = 2k + 1 个数字流，每个数字有 4 位长。 Since you're using o(log n) bits of memory and I've picked n to be "sufficiently large," we can say that your algorithm uses strictly fewer than, say, log (n - 1) - 1 = log (2k) - 1 = 1 + log k - 1 = log k bits of memory.由于您使用的是 o(log n) 位内存，而我选择 n 为“足够大”，因此我们可以说您的算法使用严格小于 log (n - 1) - 1 = log ( 2k) - 1 = 1 + log k - 1 = log k 位内存。

Now, consider the following k choices for a sequence of 4-bit numbers to stream through your algorithm.现在，请考虑以下 k 个选择，以便让 4 位数字序列流过您的算法。 The first one is k copies of 0000. The second one is k-1 copies of 0000 followed by 1 copy of 1111. The third one is k-2 copies of 0000 followed by 2 copies of 1111. And more generally, there's one sequence for each of the k+1 different choices of some number of copies of 0000 and some number of copies of 1111.第一个是 0000 的 k 个副本。第二个是 0000 的 k-1 个副本，然后是 1111 的 1 个副本。第三个是 0000 的 k-2 个副本，然后是 1111 的 2 个副本。更一般地说，有一个序列对于 k+1 种不同的选择，分别是 0000 的某些副本数和 1111 的某些副本数。

Now, run each of these k+1 possible options through your algorithm.现在，通过您的算法运行这 k+1 个可能的选项中的每一个。 You are using strictly fewer than log k bits of memory, and so there are fewer than 2 ^{log k} = k possible combinations that those bits of memory can be in. And that's a problem, because I have k+1 different sequences.您使用的内存绝对少于 log k 位，因此这些内存位可以包含的可能组合少于 2 ^{log k} = k。这是一个问题，因为我有 k+1 个不同的序列。 Therefore, there must be two of those sequences that, when run through your algorithm, cause the memory of the algorithm to end up in the same state.因此，必须有两个序列在运行您的算法时，导致算法的内存最终处于相同状态。 Let's suppose that the first one has s copies of 0000, and the second has t copies of 0000, with s < t.假设第一个有 s 个 0000 的副本，第二个有 t 个 0000 的副本，其中 s < t。

Notice that we've only fed k elements of the stream into your algorithm, so we still have k+1 remaining elements to pick.请注意，我们只将流的 k 个元素输入到您的算法中，因此我们仍有 k+1 个剩余元素可供选择。 And what if I choose them so that there are exactly ks copies of 0000, with the rest of the s + 1 elements being 1111?如果我选择它们使得 0000 正好有 ks 个副本，其余的 s + 1 个元素是 1111 呢？ Well, in that case, look what happens.那么，在这种情况下，看看会发生什么。

Take the original sequence of s copies of 0000 followed by ks copies of 1111 and run it through the algorithm.取 0000 的 s 个副本后跟 1111 的 ks 个副本的原始序列，并通过算法运行它。 Then give it ks copies of 0000 and s+1 copies of 1111. The overall stream now has k copies of 0000 and k+1 copies of 1111, which means that the median is 1111.然后给它 ks 个 0000 副本和 s+1 个 1111 副本。整个流现在有 k 个 0000 副本和 k+1 个 1111 副本，这意味着中位数是 1111。
Take the original sequence of t > s copies of 0000 followed by kt < ks copies of 1111 and run it through the algorithm.取 0000 的 t > s 个副本后跟 1111 的 kt < ks 个副本的原始序列，并通过算法运行它。 Then give it ks copies of 0000 and s+1 copies of 1111. The overall stream now has at least k+1 copies of 0000 and at most k copies of 1111, so the median should be 0000.然后给它 ks 个 0000 副本和 s+1 个 1111 副本。现在整个流至少有 k+1 个 0000 副本和最多 k 个 1111 副本，所以中位数应该是 0000。

But here we run into a problem.但是在这里我们遇到了一个问题。 The state of the algorithm is identical after seeing the first half of the input, and we've fed the same sequence in as the back half of the input, so the algorithm should behave the same way in the two above cases.在看到输入的前半部分后，算法的状态是相同的，并且我们输入的序列与输入的后半部分相同，因此算法在上述两种情况下的行为方式应该相同。 But it can't, because as we saw here the outputs are supposed to be different.但它不能，因为正如我们在这里看到的，输出应该是不同的。 This is impossible for any deterministic algorithm!这对于任何确定性算法都是不可能的！

This style of argument, by the way, is based on the idea of a fooling set , which is a set of inputs such that any two inputs have at least one suffix that can be tacked on that distinguishes those two inputs.顺便说一下，这种论证风格基于愚弄集的思想，它是一组输入，使得任何两个输入至少有一个后缀，可以添加区分这两个输入的后缀。 It's related to the Myhill-Nerode theorem for regular languages, and you can think of any deterministic streaming algorithm with bounded memory as a DFA with one state per combination of bits of memory.它与常规语言的Myhill-Nerode 定理有关，您可以将任何具有有限内存的确定性流算法视为 DFA，每个内存位组合具有一个状态。