简体   繁体   English

在二进制字符串中查找最长子字符串,其中的子字符串不小于零

[英]Find longest substring in binary string with not less ones than zeros

How to find, in a binary string, the longest substring where the balance , ie the difference between the number of ones and zeros, is >= 0? 如何在二进制字符串中找到最长的子字符串,其中余额 (即1和0的数量之差)> = 0?

Example: 例:

01110000010 -> 6: 011100 01110000010 - > 6:011100

1110000011110000111 -> 19: entire string 1110000011110000111 - > 19:整个字符串

While this problem looks very similar to the Maximum Value Contiguous Subsequence (Maximum Contiguous Sum) problem, a dynamic programming solution doesn't seem to be obvious. 虽然这个问题看起来非常类似于最大值连续子序列(最大连续和)问题,但动态编程解决方案似乎并不明显。 In a divide-and-conquer approach, how to do the merging? 在分而治之的方法中,如何进行合并? Is an "efficient" algorithm possible after all? 毕竟是一种“高效”的算法吗? (A trivial O(n^2) algorithm will just iterate over all substrings for all possible starting points.) (一个简单的O(n ^ 2)算法将迭代所有可能起点的所有子串。)

This is a modified variant of Finding a substring, with some additional conditions . 这是查找子字符串的修改变体,带有一些附加条件 The difference is that in the linked question, only such substrings are allowed where balance never falls below zero (looking at the string in either forward or backward direction). 不同之处在于,在链接问题中,只允许这样的子串,其中余额永远不会低于零(在向前或向后方向上查看字符串)。 In the given problem, balance is allowed to fall below zero, provided it recovers at some later stage. 在给定的问题中,如果平衡在稍后阶段恢复,则允许平衡降至零以下。

I have a solution that requires O(n) additional memory and O(n) time. 我有一个需要O(n)额外内存和O(n)时间的解决方案。

Let's denote the 'height' of an index h(i) as 让我们将索引h(i)的“高度”表示为

h(i) = <number of 1s in the substring 1..i> - <number of 0s in the same substring>

The problem can now be reformulated as: find i and j such as h(i) <= h(j) and ji -> max . 现在可以将问题重新表述为:找到ij例如h(i) <= h(j)ji -> max

Obviously, h(0) = 0 , and if h(n) = 0 , then the solution is the entire string. 显然, h(0) = 0 ,如果h(n) = 0 ,那么解是整个字符串。

Now let's compute the array B so that B[x] = min{i: h(i) = -x }. 现在让我们计算数组B使B[x] = min{i: h(i) = -x }。 In other words, let B[x] be the leftmost index i at which h(i)= -x . 换句话说,令B[x]是最左边的索引i ,其中h(i)= -x

The array B[x] has a length of at most n , and is computed in one linear pass. 阵列B[x]具有至多n的长度,并且在一个线性通道中计算。

Now we can iterate over the original string and for each index i compute the length of the longest sequence with non-negative balance that ends on i as follows: 现在我们可以迭代原始字符串,并且对于每个索引, i计算最长序列的长度,其中非负余额以i结尾,如下所示:

Lmax(i) = i - B[MIN{0, h(i)}]

The largest Lmax(i) across all i will give you the desired length. 所有i最大的Lmax(i)将为您提供所需的长度。

I leave the proof as an exercise :) Contact me if you can't figure it out. 我把证明留作练习:)如果你想不通,请联系我。

Also, my algorithm needs 2 passes of the original string, but you can collapse them into one. 此外,我的算法需要2次传递原始字符串,但您可以将它们合并为一个。

This can be answered quite easily in O(n) using "height array", representing the number of 1's relative to the number of 0's. 这可以在O(n)使用“高度数组”很容易地回答,表示1的数量相对于0的数量。 Like my answer in the linked question. 就像在链接问题中的答案一样。

Now, instead of focusing on the original array, we now focus on two arrays indexed by the heights , and one will contain the smallest index such height is found, and the other will contain the largest index such height is found. 现在,我们不再关注原始数组,而是关注两个由高度索引的数组,一个将包含找到高度的最小索引,另一个将包含找到这样高度的最大索引。 Since we don't want a negative index, we can shift everything up, such that the minimum height is 0. 由于我们不想要负指数,我们可以将所有内容都移动,这样最小高度为0。

So for the sample cases (I added two more 1's at the end to show my point): 因此对于示例案例(我在最后添加了两个1来表明我的观点):

1110000011010000011111
Array height visualization
  /\
 /  \
/    \
      \  /\/\        /
       \/    \      /
              \    /
               \  /
                \/
(lowest height = -5)
Shifted height array:
[5, 6, 7, 8, 7, 6, 5, 4, 3, 4, 5, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3]
     Height:   0  1  2  3  4  5  6  7  8
first_view = [17,16,15, 8, 7, 0, 1, 2, 3]
last_view  = [17,18,19,20,21,22, 5, 4, 3]

note that we have 22 numbers and 23 distinct indices, 0-22, representing the 23 spaces between and padding the numbers 请注意,我们有22个数字和23个不同的索引,0-22,表示数字之间的23个空格和填充数字

We can build the first_view and last_view array in O(n) . 我们可以在O(n)构建first_viewlast_view数组。

Now, for each height in the first_view , we only need to check every larger heights in last_view , and take the index with maximum difference from the first_view index. 现在,对于first_view每个高度,我们只需要检查last_view每个更高的高度,并获取与first_view索引最大差异的索引。 For example, from height 0, the maximum value of index in larger heights is 22. So the longest substring starting at index 17+1 will end at index 22. 例如,从高度0开始,较大高度的索引的最大值为22.因此,从索引17 + 1开始的最长子串将在索引22处结束。

To find the maximum index on the last_view array, you can convert it to a maximum to the right in O(n) : 要在last_view数组中找到最大索引,可以在O(n)中将其转换为最大值:

last_view_max = [22,22,22,22,22,22, 5, 4, 3]

And so finding answer is simply subtracting first_view from last_view_max , 所以寻找的答案是简单地减去first_viewlast_view_max

first_view    = [17,16,15, 8, 7, 0, 1, 2, 3]
last_view_max = [22,22,22,22,22,22, 5, 4, 3]
result        = [ 5, 6, 7,14,15,22, 4, 2, 0]

and taking the maximum (again in O(n) ), which is 22, achieved from starting index 0 to ending index 22, ie, the whole string. 并且从起始索引0到结束索引22,即整个字符串,取最大值(再次在O(n) ),即22。 =D = d

Proof of correctness: 正确性证明:

Suppose that the maximum substring starts at index i , ends at index j . 假设最大子字符串从索引i开始,以索引j结束。 If the height at index i is the same as the height at index k<i , then k..j would be a longer substring still satisfying the requirement. 如果索引i处的高度与索引k<i处的高度相同,则k..j将是仍然满足要求的更k..j串。 Therefore it suffices to consider the first index of each height. 因此,考虑每个高度的第一个指数就足够了。 Analogously for the last index. 类似于最后一个指数。

Compressed quadratic runtime 压缩二次运行时

We will be looking for (locally) longest substrings with balance zero, starting at the beginning. 我们将从头开始寻找具有平衡零的(本地)最长子串。 We will ignore strings of zeros. 我们将忽略零的字符串。 (Corner cases: All zeros -> empty string, balance never reaches zero again -> entire string.) Of these substrings with balance zero, all trailing zeros will be removed. (拐角情况:全零 - >空字符串,余额永远不会再达到零 - >整个字符串。)在这些余额为零的子字符串中,将删除所有尾随零。

Denote by B a substring with balance > 0 and by Z a substring with only zeros. B表示平衡> 0的子字符串,Z表示仅带零的子字符串。 Each input string can be decomposed as follows (pseudo-regex notation): 每个输入字符串可以如下分解(伪正则表达式):

B? B' (ZB)* Z? (ZB)* Z?

Each of the Bs is a maximum feasible solution, meaning that it cannot be extended in either direction without reducing balance. 每个B都是最大可行解决方案,这意味着它不能在任何方向上扩展而不会减少平衡。 However, it might be possible to collapse sequences of BZB or ZBZ if the balance is still larger than zero after collapsing. 但是,如果在折叠后余额仍大于零,则可能会折叠BZB或ZBZ的序列。

Note that it is always possible to collapse sequences of BZBZB to a single B if the ZBZ part has balance >= 0. (Can be done in one pass in linear time.) Once all such sequences have been collapsed, the balance of each ZBZ part is below zero. 注意,如果ZBZ部分具有平衡> = 0,则总是可以将BZBZB的序列折叠成单个B。(可以在线性时间内一次完成。)一旦所有这样的序列折叠,每个ZBZ的平衡部分低于零。 Still, it is possible that there exist BZB parts with balance above zero -- even that in a BZBZB sequence with balance below zero both the leading and trailing BZB parts have balance over zero. 尽管如此,仍有可能存在平衡度大于零的BZB部分 - 即使在平衡值低于零的BZBZB序列中,前导和尾随BZB部分的平衡均为零。 At this point, it seems to be difficult to decide which BZB to collapse. 在这一点上,似乎很难确定哪个BZB崩溃。

Still quadratic... 二次方......

Anyway, with this simplified data structure one can try all Bs as starting points (possibly extending to the left if there's still balance left). 无论如何,通过这种简化的数据结构,可以尝试将所有B作为起点(如果还有余额则可能向左延伸)。 Run time is still quadratic, but (in practice) with a much smaller n. 运行时间仍然是二次方,但(实际上)n小得多。

Divide and conquer 分而治之

Another classic. 另一个经典。 Should run in O(n log n), but rather difficult to implement. 应该在O(n log n)中运行,但很难实现。

Idea 理念

The longest feasible substring is either in the left half, in the right half, or it passes over the boundary. 最长的可行子串位于左半部分,右半部分,或者经过边界。 Call the algorithm for both halves. 为两半调用算法。 For the boundary: 对于边界:

Assume problem size n. 假设问题大小为n。 For the longest feasible substring that crosses the boundary, we are going to compute the balance of the left-half part of the substring. 对于跨越边界的最长可行子串,我们将计算子串的左半部分的平衡。

Determine, for each possible balance between -n/2 and n/2, in the left half, the length of the longest string that ends at the boundary and has this (or a larger) balance. 对于左半部分中-n / 2和n / 2之间的每个可能的平衡 ,确定在边界处结束并且具有该(或更大)平衡的最长字符串的长度。 (Linear time!) Do the same for the right half and the longest string that starts at the boundary. (线性时间!)对于从边界开始的右半部分和最长的字符串执行相同的操作。 The result is two arrays of size n + 1; 结果是两个大小为n + 1的数组; we reverse one of them, add them element-wise and find the maximum. 我们反转其中一个,按元素添加它们并找到最大值。 (Again, linear.) (再次,线性。)

Why does it work? 它为什么有效?

A substring with balance >= 0 that crosses the boundary can have balance < 0 in either the left or the right part, if the other part compensates this. 如果另一部分对此进行补偿,则具有平衡> = 0且跨越边界的子字符串可以在左侧或右侧部分中具有平衡<0。 ("Borrowing" balance.) The crucial question is how much to borrow; (“借款”余额。)关键问题是借款多少; we iterate over all potential "balance credits" and find the best trade-off. 我们迭代所有潜在的“平衡信用”并找到最佳权衡。

Why is this O(n log n)? 为什么这个O(n log n)?

Because merging (looking at boundary-crossing string) takes only linear time. 因为合并(查看边界交叉串)只需要线性时间。

Why is merging O(n)? 为什么合并O(n)?

Exercise left to the reader. 练习留给读者。

Dynamic programming -- linear run time (finally!) 动态编程 - 线性运行时间(最后!)

inspired by this blog post . 灵感来自这篇博文 Simple and efficient, one-pass online algorithm , but takes some time to explain. 简单高效,一次通过在线算法 ,但需要一些时间来解释。

Idea 理念

The link above shows a different problem: Maximum subsequence sum. 上面的链接显示了一个不同的问题:最大子序列和。 It cannot be mapped 1:1 to the given problem, here a "state" of O(n) is needed, in contrast to O(1) for the original problem. 它不能以1:1的方式映射到给定的问题,这里需要O(n)的“状态”,与原始问题的O(1)相反。 Still, the state can be updated in O(1). 仍然可以在O(1)中更新状态。

Let's rephrase the problem. 让我们重新解释一下这个问题。 We are looking for the longest substring in the input where the balance , ie the difference between 0 's and 1 's, is greater than zero. 我们正在寻找输入中最长的子串,其中平衡 ,即01之间的差值大于零。

The state is similar to my other divide-and-conquer solution: We compute, for each position i and for each possible balance b the starting position s(i, b) of the longest string with balance b or greater that ends at position i . 状态类似于我的其他分而治之的解决方案:我们计算每个位置i 每个可能的余额 b最长字符串的起始位置s(i, b) ,其中余额为b或更大, 结束于位置i That is, the string that starts at index s(i, b) + 1 and ends at i has balance b or greater, and there is no longer such string that ends at i . 也就是说,从索引s(i, b) + 1并在i结束的字符串具有余额b或更大,并且不再有这样的字符串在i处结束。 We find the result by maximizing i - s(i, 0) . 我们通过最大化i - s(i, 0)找到结果。

Algorithm 算法

Of course, we do not keep all s(i, b) in memory, just those for the current i (which we iterate over the input). 当然,我们不会将所有s(i, b)保留在内存中,只保留当前i (我们迭代输入)的内存。 We start with s(0, b) := 0 for b <= 0 and := undefined for b > 0 . 我们从s(0, b) := 0表示b <= 0:= undefined表示b > 0 For each i , we update with the following rule: 对于每个i ,我们使用以下规则进行更新:

  1. If 1 is read: s(i, b) := s(i - 1, b - 1) . 如果读取1s(i, b) := s(i - 1, b - 1)
  2. If 0 is read: s(i, b) := s(i - 1, b + 1) if defined, s(i, 0) := i if s(i - 1, 1) undefined. 如果读取0s(i, b) := s(i - 1, b + 1)如果定义, s(i, 0) := i如果s(i - 1, 1)未定义。

The function s (for current i ) can be implemented as a pointer into an array of length 2n + 1 ; 函数s (对于当前i )可以实现为指向长度2n + 1的数组的指针; this pointer is moved forward or backward depending on the input. 根据输入,此指针向前或向后移动。 At each iteration, we note the value of s(i, 0) . 在每次迭代中,我们都注意到s(i, 0)

How does it work? 它是如何工作的?

The state function s becomes effective especially if the balance from the start to i is negative. 如果从开始到i的平衡为负,则状态函数s变得有效。 It records the earliest start point where zero balance is reached, for all possible numbers of 1 s that have not been read yet. 它记录了达到零平衡的最早起始点,对于尚未读取的所有可能的1秒数。

Why does it work? 它为什么有效?

Because the recursive definition of the state function is equivalent to its direct definition -- the starting position of the longest string with balance b or greater that ends at position i . 因为状态函数的递归定义等同于其直接定义 - 最长字符串的起始位置,其余为b或更大,以位置i结束。

Why is the recursive definition correct? 为什么递归定义是正确的?

Proof by induction. 通过归纳证明。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM