Count number of distinct subarrays with at most m even elements

Question

You are given an array A of integers, each of which is in the range [0, 1000], along with some number m. For example, you might get this input:

A=[5,6,7,8] m=1

The question is to determine, as efficiently as possible, how many distinct, nonempty subarrays there are of the array A that contain at most m even numbers. For example, for the above array, there are eight distinct subarrays with at most one even number, as shown here:

[(5, 6, 7), (6, 7), (5, 6), (8), (5), (6), (7), (7, 8)]

Here's the solution I have so far, which runs in time O(n ³ ):

def  beautiful(A, m):
    subs = [tuple(A[i:j]) for i in range(0, len(A)) for j in range(i + 1, len(A) + 1)]
    uniqSubs = set(subs)

     return len([n  for n in uniqSubs if sum(int(i) % 2 == 0  for i in n)<=m ])

Is there a better solution to this problem - ideally, one that runs in linear time or atleast O(n^2)?

Answer 1

I believe you can do this in linear time by using suffix trees. This is certainly not a lightweight solution - good luck coding up a linear-time algorithm for building a suffix tree with a variable-size alphabet! - but it shows that it's possible.

Here's the idea. Start by building a suffix tree for the array, treating it not as a list of numbers, but rather as a string of characters, where each character is a number. Since you know all the numbers are at most 1,000, the number of distinct characters is a constant, so using a fast suffix tree construction algorithm (for example, SA-IS), you can build the suffix tree in time O(n).

Suffix trees are a nice structure here because they collapse repeated copies of the same substrings together into overlapping groups, which makes it easier to deduplicate things. For example, if the pattern [1, 3, 7] appears multiple times in the array, then the root will contain exactly one path starting with [1, 3, 7].

The question now is how to go from the suffix tree to the number of distinct subarrays. For now, let's tackle an easier question - how do you count up the number of distinct subarrays, period, completely ignoring the restriction on odd and even numbers? This, fortunately, turns out to be a well-studied problem that can be solved in linear time. Essentially, every prefix encoded in the suffix tree corresponds to a distinct subarray of the original array, so you just need to count up how many prefixes there are. That can be done by recursively walking the tree, adding up, for each edge in the tree, how many characters are along that edge. This can be done in time O(n) because a suffix tree for an array/string of length n has O(n) nodes, and we spend a constant amount of time processing each node (just by looking at the edge above it.)

So now we just need to incorporate the restriction on the number of even numbers you're allowed to use. This complicates things a little bit, but the reason why is subtle. Intuitively, it seems like this shouldn't be a problem. We could, after all, just do a DFS of the suffix tree and, as we go, count the number of even numbers on the path we've traversed, stopping as soon as we exceed m.

The problem with this approach is that even though the suffix tree has O(n) nodes in it, the edges, implicitly, encode ranges whose lengths can be as high as n itself. As a result, the act of scanning the edges could itself blow the runtime up to Ω(n ² ): visiting Θ(n) edges and doing Ω(n) work per edge.

We can, however, speed things up a little bit. Each edge in a suffix tree is typically represented as a pair of indices [start, stop] into the original array. So let's imagine that, as an additional preprocessing step, we build a table Evens such that Evens[n] returns the number of even numbers in the array up to and including position n. Then we can count the number of even numbers in any range [start, stop] by computing Evens[start] - Evens[stop]. That takes time O(1), and it means that we can aggregate the number of even numbers we encounter along a path in time proportional to the number of edges followed, not the number of characters encountered.

... except that there's one complication. What happens if we have a very long edge where, prior to reading that edge, we know that we're below the even number limit, and after reading that edge, we know that we're above the limit? That means that we need to stop partway through the edge, but we're not sure exactly where that is. That might require us to do a linear search over the edge to find the crossover point, and there goes our runtime.

But fortunately, there's a way out of that little dilemma. (This next section contains an improvement found by @Matt Timmermans). As part of the preprocessing, in addition to the Evens array, build a second table KthEven, where KthEven[i] returns the position of the kth even number in the array. This can be built in time O(n) using the Evens array. Once you have this, let's imagine that you have a bad edge, one that will push you over the limit. If you know how many even numbers you've encountered so far, you can determine the index of the even number that will push you over the limit. Then, you can look up where that even number is by indexing into the KthEven table in time O(1). This means that we only need to spend O(1) work per edge in the suffix tree, pushing our runtime down to O(n)!

So, to recap, here's a linear-time solution to this problem:

Build a suffix tree for the array using a fast suffix tree construction algorithm, like SA-IS or Ukkonen's algorithm. This takes time O(n) because there are at most 1,000 different numbers in the string, and 1,000 is a constant.
Compute the table Even[n] in time O(n).
Compute the table KthEven[n] in time O(n).
Do a DFS over the tree, keeping track of the number of even numbers encountered so far. When encountering an edge [start, stop], compute how many even numbers are in that range using Even in time O(1). If that's below the limit, keep recursing. If not, use the KthEven table to figure out how much of the edge is usable in time O(1). Either way, increment the global count of the number of distinct subarrays by the usable length of the current edge. This does O(1) work for each of the O(n) edges in the suffix tree for a total of O(n) work.

Phew! That wasn't an easy problem. I imagine there's some way to simplify this construction, and I'd welcome comments and suggestions about how to do this. But it shows that it is indeed possible to solve this problem in O(n) time, which is not immediately obvious!

Count number of distinct subarrays with at most m even elements

Question

1 answers

solution1
2 2018-08-04 23:17:00

Count number of distinct subarrays with at most m even elements

Question

1 answers

solution1 2 2018-08-04 23:17:00

solution1
2 2018-08-04 23:17:00