简体   繁体   中英

Count character occurrences in a substring in O(1) time with preprocessing

This question has not been asked before, since I am specifically asking O(1) constant time after the time taken to preprocess the string is amortized over many freq operations.

An interviewer asked me to find the total count of a character appearing in a substring. For example, if you are given a string, the character to find, and the start and end index of the positions you want to search in, I am supposed to find the most optimized way eg:

String s = "abcnc";
char find = "c"
int start = 1;
int end = 4;

that should return a result of 2 since 'c' appears twice in the specified substring of 'bcnc' .

What I did was straightforward, which was

int freq(String s, char c, int start, int end) {
    int result = 0; 
    for(int i = start, i < end; i++) {
        if(s.charAt(i) == c) {
            result++;
        }
    }
    
    return result;
}

Which has an O(N) time complexity.

However, the interviewer said it can be more optimized by first preprocessing the string, where the freq() method can have O(1) complexity. I was stumped because I don't know how it can be more optimized other than O(N) . The interviewer told me I should use a map or list or both and to first find the index those characters are positioned and that will give me a more optimized solution.

You can do it in O(1) (constant time) using O(N) space and O(N) pre-processing time, and without the use of maps or anything else other than basic java, as follows:

Make a single pass over the string and keep a cumulative count of each letter seen so far:

int[][] counts;

public void process(String input) {
    counts = new int[input.length()][];
    for (int i = 0; i < input.length(); i++) {
        counts[i] = i == 0 ? new int[26] : Arrays.copyOf(counts[i - 1], 26);
        counts[i][input.charAt(i) - 'a']++;
    }
}

Then return the difference between the counts at start and end:

public int count( char c, int start, int end) {
    return counts[end][c - 'a'] - counts[start][c - 'a'];
}

First, convert your input string into a map that maps characters onto a sorted list of indices. For example:

String input = "abcnc";
var store = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < input.length(); i++) {
  store.computeIfAbsent(input.charAt(i), k -> new TreeSet<Integer>()).add(i);
}

System.out.println(store);

This makes: {a=[0], b=[1], c=[2, 4], n=[3]}

And the cost for making this thing can be written off as 'preprocessing', but if it matters, O(nlogn).

Armed with store , you can do the job in O(log n) time. For example, if I want to know how many c s are in the 3-5 range, I for ask for the TreeSet match c (getting me the [2, 4] treeset). Then I can use treeset's headSet and tailSet methods to figure it out, which are each O(logn) operations.

This gives me a total runtime of O(logn) which is as near to O(1) as makes it irrelevant (in the sense that practical concerns about how modern computer architecture works will dwarf this). If an interviewer does not accept that answer, the are either needlessly pedantic or wildly misguided about how modern computers work, so now we delve into a purely academic exercise to knock it down to an O(1) .

O(1)

instead of mapping the character to a TreeSet, we instead map it to an int[] . This int array is as large as the entire input (so in this case, the 4 int[] arrays for keys 'a', 'b', 'c', and 'n' are all 5 large because the input is 5 large). This int array answers the question: If you asked me for the answer from THIS position to the end of the string, what is the correct answer? So, for c it would be: [2, 2, 2, 1, 1]. Note that the last number (0) is missing as we don't need it (the amount of Xs from the end of the string to the end of the string is.. of course, 0, always, regardless of what character we are talkkng about). Had the string input been abcnca, then the int arrays are 6 large and for c, would contain [2, 2, 2, 1, 1, 0].

Armed with this array, the answer can be provided in O(1) time: It's 'the answer I would give if you asked me to go from start index to the end-of-string', minus 'the answer I would give if you asked me to go from end index to the end-of-string'. Taking into account, of course, that if the question's end index matches end-of-string, just look up the answer in the int array (don't subtract anything).

This means the time taken after preprocessing is O(1), but the size of the 'preprocessed' data structure is now considerable. For example, if the input string is 1 MB large and contains 5000 different characters, it's a 20GB table (4 bytes per number, 5000 entries in the map, each entry mapping to an array with a million entries, at 4 bytes a pop, is 5000*1000000*4 = 20GB ).

This question has not been asked before since I am specifically asking O(1) constant time after the time taken to preprocess the string is amortized over many freq operations.

To achieve an amortized constant time for this method, you can generate a HashMap , associating each character with an array of length equal to the length of the given String, where every array element represents the number of occurrences of a particular character from the beginning of the String up to a particular index.

The first call of the freq() would populate the Map and would run in O(n) , subsequent calls would execute in constant time O(1) .

Therefore, so-called amortized time Complexity , which considers the upper bound of the total cost of N invocations of the operation, would be O(1) .

The algorithm of populating the Map resembles the Counting sort algorithm (we need to perform the steps, represented by the two firsts for -loop in the pseudocode available via the link).

In order to populate the Map, every unique character encountered in the string needs to be associated with an array int[] having the length of the given string. And during the iteration, an element under the current index in the array that corresponds to the current character should be incremented ( that's basically the first phase in the Counting sort ).

The next step is to iterate over the Values of the Map and calculate the cumulative frequency for each array of frequencies, so that every element would represent the total number of occurrences of a certain character from the very beginning of the String up to a particular index ( this step is identical to the second phase in the Counting sort ).

That's how implementation might look like:

public static final Map<Character, int[]> FREQ_BY_CHAR = new HashMap<>();

public static int freq(String s, char c, int start, int end) {
    
    if (FREQ_BY_CHAR.isEmpty()) populate(s);

    int[] frequencies = FREQ_BY_CHAR.get(c);
    
    return frequencies[end] - frequencies[start];
}

public static void populate(String s) {
    countFrequencies(s);
    calculateCumulativeFrequencies();
}

public static void countFrequencies(String s) {
    for (int i = 0; i < s.length(); i++) {
        char next = s.charAt(i);
        int[] frequencies = FREQ_BY_CHAR.computeIfAbsent(next, k -> new int[s.length()]);
        frequencies[i]++;
    }
}

public static void calculateCumulativeFrequencies() {
    FREQ_BY_CHAR.values().forEach(freq -> accumulate(freq));
}

public static void accumulate(int[] freq) {
    for (int i = 1; i < freq.length; i++) freq[i] += freq[i - 1];
}

main()

public static void main(String[] args) {
    String s = "abcnc";
    char find = 'c';
    int start = 1;
    int end = 4;

    System.out.println(freq(s, find, start, end));

    FREQ_BY_CHAR.forEach((ch, arr1) -> System.out.println(ch + " -> " + Arrays.toString(arr1))); 
}

Output:

2
// contents of the Map
//    a  b  c  n  c
a -> [1, 1, 1, 1, 1]
b -> [0, 1, 1, 1, 1]
c -> [0, 0, 1, 1, 2]
n -> [0, 0, 0, 1, 1] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM