简体   繁体   English

正则表达式模式和/或 NSRegularExpression 在非常大的文件中搜索有点太慢,可以优化吗?

[英]Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic在 iOS 框架中,我正在搜索这个 3.2 MB 文件的发音: https : //cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray.我正在使用 NSRegularExpression 来搜索作为 NSArray 给出的任意一组单词。 The search is done through the contents of the large file as an NSString.搜索是通过大文件的内容作为 NSString 完成的。 I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:我需要匹配出现在由换行符和制表符括起来的任何单词,然后抓取整行,例如,如果我的 NSArray 中有单词“monday”,我想匹配字典文件中的这一行:

monday  M AH N D IY

This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows.这一行以换行符开头,字符串“monday”后跟一个制表符,然后是发音。 The entire line needs to be matched by the regex for its ultimate output.整行需要与正则表达式匹配以获得最终输出。 I also need to find alternate pronunciations of the words which are listed as follows:我还需要找到以下列出的单词的替代发音:

monday(2)   M AH N D EY

The alternative pronunciations always begin with (2) and can go as high as (5).替代发音总是以 (2) 开头,最高可达 (5)。 So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.因此,我还搜索单词的迭代,后跟括号,其中包含由换行符和制表符括起来的单个数字。

I have a 100% working NSRegularExpression method as follows:我有一个 100% 工作的 NSRegularExpression 方法如下:

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

My issue is that given the big text file, it isn't really fast enough on the iPhone.我的问题是,鉴于大文本文件,它在 iPhone 上的速度还不够快。 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. 8 个单词在 iPhone 4 上需要 1.3 秒,这对于应用程序来说太长了。 Given the following known factors:鉴于以下已知因素:

• The 3.2 MB text file has the words to match listed in alphabetical order • 3.2 MB 的文本文件包含按字母顺序列出的匹配词

• The array of arbitrary words to look up are always in alphabetical order when they get to this method • 要查找的任意单词数组在使用此方法时始终按字母顺序排列

• Alternate pronunciations start with (2) in parens after the word, not (1) • 单词后括号中的替代发音以 (2) 开头,而不是 (1)

• If there is no (2) there won't be a (3), (4) or more • 如果没有 (2),则不会有 (3)、(4) 或更多

• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. • 一种替代发音的出现很少见,平均八分之一出现。 Further alternate pronunciations are even rarer.进一步的替代发音甚至更少见。

Can this method be optimized, either by improving the regex or some aspect of the Objective-C?可以通过改进正则表达式或 Objective-C 的某些方面来优化此方法吗? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know.我假设 NSRegularExpression 已经足够优化,不值得尝试使用不同的 Objective-C 库或 C 来完成它,但是如果我在这里弄错了,请告诉我。 Otherwise, very grateful for any suggestions on improving the performance.否则,非常感谢任何有关提高性能的建议。 I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.我希望将其推广到任何发音文件,因此我试图远离诸如提前计算字母范围之类的解决方案以进行更多受限搜索。

****EDIT**** ****编辑****

Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:以下是 2012 年 8 月 16 日给出的所有与搜索相关的答案在 iPhone 4 上的时间:

dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717 : 5.259676 seconds dasblinkenlight 的创建 NSDictionary 方法https://stackoverflow.com/a/11958852/119717:5.259676

Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717 : 0.609593 seconds Ωmega 在https://stackoverflow.com/a/11957535/119717上最快的正则表达式:0.609593 秒

dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717 : 1.255130 seconds dasblinkenlight 在https://stackoverflow.com/a/11969602/119717 上的多重 NSRegularExpression 方法:1.255130 秒

my first hybrid approach at https://stackoverflow.com/a/11970549/119717 : 0.372215 seconds我在https://stackoverflow.com/a/11970549/119717 上的第一个混合方法:0.372215 秒

my second hybrid approach at https://stackoverflow.com/a/11970549/119717 : 0.337549 seconds我在https://stackoverflow.com/a/11970549/119717 上的第二种混合方法:0.337549 秒

The best time so far is the second version of my answer.到目前为止最好的时间是我的答案的第二个版本。 I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others.我无法将任何答案标记为最佳,因为所有与搜索相关的答案都反映了我在我的版本中采用的方法,因此它们都非常有帮助,而我的只是基于其他答案。 I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.我学到了很多东西,我的方法结束了原始时间的四分之一,所以这非常有帮助,感谢 dasblinkenlight 和 Ωmega 与我讨论。

Try this one:试试这个:

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and also this one (using positive lookahead with list of possible first letters):还有这个(使用带有可能的第一个字母列表的正向前瞻):

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and at the end, a version with some optimization:最后,一个经过一些优化的版本:

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

Since you are putting the entire file into memory anyway, you might as well represent it as a structure that is easy to search:由于您无论如何都要将整个文件放入内存中,因此您不妨将其表示为易于搜索的结构:

  • Create a mutable NSDictionary words , with NSString keys and NSMutableArray values创建一个可变的NSDictionary words ,带有NSString键和NSMutableArray
  • Read the file into memory将文件读入内存
  • Go through the string representing the file line-by-line逐行遍历表示文件的字符串
  • For each line , separate out the word part by searching for a '(' or a '\\t' character对于每一line ,通过搜索'(''\\t'字符来分离单词部分
  • Get a sub-string for the word (from zero to the index of the '(' or '\\t' minus one); this is your key .获取单词的子字符串(从 0 到'(''\\t'减一的索引);这是您的key
  • Check if the words contains your key ;检查words包含您的key if it does not, add new NSMutableArray如果没有,添加新的NSMutableArray
  • Add line to the NSMutableArray that you found/created at the specific keyline添加到您在特定key处找到/创建的NSMutableArray
  • Once your are finished, throw away the original string representing the file.完成后,扔掉代表文件的原始字符串。

With this structure in hand, you should be able to do your searches in time that no regex engine would be able to match, because you replaced a full-text scan, which is linear, with a hash look-up, which is constant-time.有了这个结构,您应该能够及时进行任何正则表达式引擎无法匹配的搜索,因为您将线性的全文扫描替换为哈希查找,这是常量 -时间。

** EDIT: ** I checked the relative speed of this solution vs. regex, it is about 60 times faster on a simulator. ** 编辑:** 我检查了这个解决方案与正则表达式的相对速度,它在模拟器上大约快 60 倍。 This is not at all surprising, because the odds are stacked heavily against the regex-based solution.这一点都不奇怪,因为与基于正则表达式的解决方案相比,可能性很大。

Reading the file:读取文件:

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

Searching:搜索:

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

Here is my hybrid approach of dasblinkenlight's and Ωmega's answers, which I thought I should add as an answer as well at this point.这是我的 dasblinkenlight 和 Ωmega 答案的混合方法,我认为此时我也应该将其添加为答案。 It uses dasblinkenlight's method of doing a forward search through the string and then performs the full regex on a small range in the event of a hit, so it exploits the fact that the dictionary and words to look up are both in alphabetical order and benefits from the optimized regex.它使用 dasblinkenlight 的方法对字符串进行前向搜索,然后在命中时在小范围内执行完整的正则表达式,因此它利用了字典和要查找的单词均按字母顺序排列的事实,并受益于优化的正则表达式。 Wish I had two best answer checks to give out!希望我有两个最佳答案检查可以发出! This gives the correct results and takes about half of the time of the pure regex approach on the Simulator (I have to test on the device later to see what the time comparison is on the iPhone 4 which is the reference device):这给出了正确的结果,并且花费了模拟器上纯正则表达式方法的大约一半的时间(我必须稍后在设备上进行测试,以查看 iPhone 4 上的时间比较,这是参考设备):

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

EDIT: here is another version which uses no regex and shaves a little bit more time off of the method:编辑:这是另一个版本,它不使用正则表达式并减少了该方法的时间:

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

Looking at the dictionary file you provided, I'd say that a reasonable strategy could be reading in the data and putting it into any sort of persistent data store.查看您提供的字典文件,我想说一个合理的策略是读入数据并将其放入任何类型的持久数据存储中。

Read through the file and create objects for each unique word, with n strings of pronunciations (where n is the number of unique pronunciations).通读文件并为每个唯一的单词创建对象,其中包含n个发音字符串(其中n是唯一发音的数量)。 The dictionary is already in alphabetical order, so if you parsed it in the order that you're reading it you'd end up with an alphabetical list.字典已经是按字母顺序排列的,所以如果你按照阅读的顺序解析它,你最终会得到一个按字母顺序排列的列表。

Then you can do a binary search on the data - even with a HUGE number of objects a binary search will find what you're looking for very quickly (assuming alphabetical order).然后你可以对数据进行二分搜索——即使有大量的对象,二分搜索也会很快找到你要找的东西(假设按字母顺序)。

You could probably even keep the whole thing in memory if you need lightning-fast performance.如果您需要闪电般的性能,您甚至可以将整个内容保存在内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM