Is there any way to make this a faster algorithm?

Question

If I understand Big O Notation, and believe me my understanding at this point is probably much lower than most, the following line of code is O(n ² ) per the comment by Keyser this is in fact already an O(n) operation:

"Hello, World!".ToLower().Contains("a");

because ToLower() is an O(n) operation and Contains is as well. Maybe it's O(n + n) , again, my understanding is still fuzzy.

NOTE: below is a listing of test methods that were run in a Release build, and leverage the Stopwatch class to track run time.

However, I'd like to make it faster , and so consider these three test methods:

private static void TestToLower(int i)
{
    var s = "".PadRight(i, 'A');

    var sw = Stopwatch.StartNew();
    s.ToLower().Contains('b');
    sw.Stop();

    _tests.Add(string.Format("ToLower{0}", i), sw.ElapsedMilliseconds);
}

private static void TestHashSet(int i)
{
    var s = "".PadRight(i, 'A');

    var sw = Stopwatch.StartNew();
    var lookup = new HashSet<char>(s.ToLower().AsEnumerable());
    lookup.Contains('b');
    sw.Stop();

    _tests.Add(string.Format("ToHashSet{0}", i), sw.ElapsedMilliseconds);
}

private static void TestHashSet2(int i)
{
    var s = "".PadRight(i, 'A');

    var sw = Stopwatch.StartNew();
    var lookup = new HashSet<char>(s.ToLower().ToArray());
    lookup.Contains('b');
    sw.Stop();

    _tests.Add(string.Format("ToHashSet2{0}", i), sw.ElapsedMilliseconds);
}

Now consider executing those like this:

TestToLower(1000000);
TestToLower(2000000);
TestToLower(4000000);

TestHashSet(1000000);
TestHashSet(2000000);
TestHashSet(4000000);

TestHashSet2(1000000);
TestHashSet2(2000000);
TestHashSet2(4000000);

the results are as follows:

ToLower1000000: 22.00 ms
ToLower2000000: 40.00 ms
ToLower4000000: 84.00 ms
ToHashSet1000000: 48.00 ms
ToHashSet2000000: 73.00 ms
ToHashSet4000000: 145.00 ms
ToHashSet21000000: 58.00 ms
ToHashSet22000000: 107.00 ms
ToHashSet24000000: 219.00 ms

Each of them clearly has to use the ToLower method still, but I'm attempting to use the HashSet to make the lookup faster. Ideally you wouldn't have to scan the entire string. Further, I really thought the second overall test, TestHashSet , would be faster because it doesn't have to create significant chunks of memory to allocate the HashSet .

In retrospect I see why the last two methods are slower, I think. I believe they are slower because I've got the same algorithm as the first (ie I have to go through the entire string twice at a minimum) but then on top of that I'm doing the lookup after that.

How can I make this algorithm faster? We use this a lot, where we have to compare strings regardless of case.

Answer 1

No offense intended, but you don't understand big-O. O(n + n) is the same as O(n). The whole point of big-O is to "hide" constant factors. You can't do better than O(n) with one processor on this problem. You might get O(n/k) on k cores by splitting the string into k pieces and searching them with separate threads.

Converting a character to lower case is a constant time operation. Checking for a match with a desired character is a cheap constant time operation. Inserting a character in a hash set is a fairly expensive constant time operation. In your hash set tests, you have added this rather large constant cost to the handling of each character. Since it is larger than the constant cost of merely looking at the character to see if it matches the pattern string, your run times get longer.

Using a hash set for lookup makes sense only if you're looking up many values. If you need to do multiple lookups on the same string to see if it contains a any or all of k different characters, then you will probably benefit by building the hash set because k lookups will take O(k) time rather than the O(kn) time to scan the whole string for each character.

If you are looking for only one character in each string, forget big-O. Constant factors are your best hope. You should consider a low-level loop. It would go something like this:

static bool findChar(string str, char charToFind) {
  char upper = Char.toUpper(charToFind);
  char lower = Char.toLower(charToFind);
  for (int i = 0; i < str.length; i++) {
    if (str[i] == upper || str[i] == lower) {
      return true;
    }
  }
  return false;
}

Sorry in advance for syntax problems. I'm not a C# programmer. Note this scans the string at most once. If the character is found early, it stops. The expected number of characters checked is half those in the string. This function also generates no garbage.

On the other hand the expected number of characters touched by

str.ToLower().Contains("a");

is 1.5 times the length of str , and garbage will be generated. So you might win with the explicit loop

If this is still too slow, a native function might produce a small gain. You'd have to try it to find out.

Answer 2

I believe your code is O(2n) = O(n) . That's because each call traverses the input string 2 times. To reduce the algorithmic bound on your running time, you would need an algorithm that has a logarithmic bound, or O(n^k), with k<1 algorithm, which I believe is impossible in your scenario. The best I can suggest is to utilize invariant specific information: for example, if you know that your strings always have the first letter in uppercase, only change the first character in the string. This is an example of how you could exploit domain-specific knowledge.

Is there any way to make this a faster algorithm?

Question

2 answers

solution1
3 ACCPTED 2013-08-14 01:15:39

solution2
1 2013-08-13 20:33:29

Is there any way to make this a faster algorithm?

Question

2 answers

solution1 3 ACCPTED 2013-08-14 01:15:39

solution2 1 2013-08-13 20:33:29

solution1
3 ACCPTED 2013-08-14 01:15:39

solution2
1 2013-08-13 20:33:29