Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

Question

In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:

monday  M AH N D IY

This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:

monday(2)   M AH N D EY

The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.

I have a 100% working NSRegularExpression method as follows:

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:

• The 3.2 MB text file has the words to match listed in alphabetical order

• The array of arbitrary words to look up are always in alphabetical order when they get to this method

• Alternate pronunciations start with (2) in parens after the word, not (1)

• If there is no (2) there won't be a (3), (4) or more

• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.

Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.

****EDIT****

Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:

dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717 : 5.259676 seconds

Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717 : 0.609593 seconds

dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717 : 1.255130 seconds

my first hybrid approach at https://stackoverflow.com/a/11970549/119717 : 0.372215 seconds

my second hybrid approach at https://stackoverflow.com/a/11970549/119717 : 0.337549 seconds

The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.

Answer 1

Try this one:

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and also this one (using positive lookahead with list of possible first letters):

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and at the end, a version with some optimization:

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

Answer 2

Since you are putting the entire file into memory anyway, you might as well represent it as a structure that is easy to search:

Create a mutable NSDictionary words , with NSString keys and NSMutableArray values
Read the file into memory
Go through the string representing the file line-by-line
For each line , separate out the word part by searching for a '(' or a '\\t' character
Get a sub-string for the word (from zero to the index of the '(' or '\\t' minus one); this is your key .
Check if the words contains your key ; if it does not, add new NSMutableArray
Add line to the NSMutableArray that you found/created at the specific key
Once your are finished, throw away the original string representing the file.

With this structure in hand, you should be able to do your searches in time that no regex engine would be able to match, because you replaced a full-text scan, which is linear, with a hash look-up, which is constant-time.

** EDIT: ** I checked the relative speed of this solution vs. regex, it is about 60 times faster on a simulator. This is not at all surprising, because the odds are stacked heavily against the regex-based solution.

Reading the file:

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

Searching:

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

Answer 3

Here is my hybrid approach of dasblinkenlight's and Ωmega's answers, which I thought I should add as an answer as well at this point. It uses dasblinkenlight's method of doing a forward search through the string and then performs the full regex on a small range in the event of a hit, so it exploits the fact that the dictionary and words to look up are both in alphabetical order and benefits from the optimized regex. Wish I had two best answer checks to give out! This gives the correct results and takes about half of the time of the pure regex approach on the Simulator (I have to test on the device later to see what the time comparison is on the iPhone 4 which is the reference device):

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

EDIT: here is another version which uses no regex and shaves a little bit more time off of the method:

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

Answer 4

Looking at the dictionary file you provided, I'd say that a reasonable strategy could be reading in the data and putting it into any sort of persistent data store.

Read through the file and create objects for each unique word, with n strings of pronunciations (where n is the number of unique pronunciations). The dictionary is already in alphabetical order, so if you parsed it in the order that you're reading it you'd end up with an alphabetical list.

Then you can do a binary search on the data - even with a HUGE number of objects a binary search will find what you're looking for very quickly (assuming alphabetical order).

You could probably even keep the whole thing in memory if you need lightning-fast performance.

Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

Question

4 answers

solution1
4 2012-08-14 17:22:22

solution2
4 2012-08-14 18:51:09

solution3
2 2012-08-15 13:52:59

solution4
1 2012-08-14 17:26:18

Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

Question

4 answers

solution1 4 2012-08-14 17:22:22

solution2 4 2012-08-14 18:51:09

solution3 2 2012-08-15 13:52:59

solution4 1 2012-08-14 17:26:18

solution1
4 2012-08-14 17:22:22

solution2
4 2012-08-14 18:51:09

solution3
2 2012-08-15 13:52:59

solution4
1 2012-08-14 17:26:18