正則表達式模式和/或 NSRegularExpression 在非常大的文件中搜索有點太慢，可以優化嗎？

Question

在 iOS 框架中，我正在搜索這個 3.2 MB 文件的發音： https : //cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

我正在使用 NSRegularExpression 來搜索作為 NSArray 給出的任意一組單詞。 搜索是通過大文件的內容作為 NSString 完成的。 我需要匹配出現在由換行符和制表符括起來的任何單詞，然后抓取整行，例如，如果我的 NSArray 中有單詞“monday”，我想匹配字典文件中的這一行：

monday  M AH N D IY

這一行以換行符開頭，字符串“monday”后跟一個制表符，然后是發音。 整行需要與正則表達式匹配以獲得最終輸出。 我還需要找到以下列出的單詞的替代發音：

monday(2)   M AH N D EY

替代發音總是以 (2) 開頭，最高可達 (5)。 因此，我還搜索單詞的迭代，后跟括號，其中包含由換行符和制表符括起來的單個數字。

我有一個 100% 工作的 NSRegularExpression 方法如下：

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

我的問題是，鑒於大文本文件，它在 iPhone 上的速度還不夠快。 8 個單詞在 iPhone 4 上需要 1.3 秒，這對於應用程序來說太長了。 鑒於以下已知因素：

• 3.2 MB 的文本文件包含按字母順序列出的匹配詞

• 要查找的任意單詞數組在使用此方法時始終按字母順序排列

• 單詞后括號中的替代發音以 (2) 開頭，而不是 (1)

• 如果沒有 (2)，則不會有 (3)、(4) 或更多

• 一種替代發音的出現很少見，平均八分之一出現。 進一步的替代發音甚至更少見。

可以通過改進正則表達式或 Objective-C 的某些方面來優化此方法嗎？ 我假設 NSRegularExpression 已經足夠優化，不值得嘗試使用不同的 Objective-C 庫或 C 來完成它，但是如果我在這里弄錯了，請告訴我。 否則，非常感謝任何有關提高性能的建議。 我希望將其推廣到任何發音文件，因此我試圖遠離諸如提前計算字母范圍之類的解決方案以進行更多受限搜索。

****編輯****

以下是 2012 年 8 月 16 日給出的所有與搜索相關的答案在 iPhone 4 上的時間：

dasblinkenlight 的創建 NSDictionary 方法https://stackoverflow.com/a/11958852/119717：5.259676秒

Ωmega 在https://stackoverflow.com/a/11957535/119717上最快的正則表達式：0.609593 秒

dasblinkenlight 在https://stackoverflow.com/a/11969602/119717 上的多重 NSRegularExpression 方法：1.255130 秒

我在https://stackoverflow.com/a/11970549/119717 上的第一個混合方法：0.372215 秒

我在https://stackoverflow.com/a/11970549/119717 上的第二種混合方法：0.337549 秒

到目前為止最好的時間是我的答案的第二個版本。 我無法將任何答案標記為最佳，因為所有與搜索相關的答案都反映了我在我的版本中采用的方法，因此它們都非常有幫助，而我的只是基於其他答案。 我學到了很多東西，我的方法結束了原始時間的四分之一，所以這非常有幫助，感謝 dasblinkenlight 和 Ωmega 與我討論。

Answer 1

試試這個：

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

還有這個（使用帶有可能的第一個字母列表的正向前瞻）：

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

最后，一個經過一些優化的版本：

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

Answer 2

由於您無論如何都要將整個文件放入內存中，因此您不妨將其表示為易於搜索的結構：

創建一個可變的NSDictionary words ，帶有NSString鍵和NSMutableArray值
將文件讀入內存
逐行遍歷表示文件的字符串
對於每一line ，通過搜索'('或'\\t'字符來分離單詞部分
獲取單詞的子字符串（從 0 到'('或'\\t'減一的索引）；這是您的key 。
檢查words包含您的key ； 如果沒有，添加新的NSMutableArray
將line添加到您在特定key處找到/創建的NSMutableArray
完成后，扔掉代表文件的原始字符串。

有了這個結構，您應該能夠及時進行任何正則表達式引擎無法匹配的搜索，因為您將線性的全文掃描替換為哈希查找，這是常量 -時間。

** 編輯：** 我檢查了這個解決方案與正則表達式的相對速度，它在模擬器上大約快 60 倍。 這一點都不奇怪，因為與基於正則表達式的解決方案相比，可能性很大。

讀取文件：

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

搜索：

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

Answer 3

這是我的 dasblinkenlight 和 Ωmega 答案的混合方法，我認為此時我也應該將其添加為答案。 它使用 dasblinkenlight 的方法對字符串進行前向搜索，然后在命中時在小范圍內執行完整的正則表達式，因此它利用了字典和要查找的單詞均按字母順序排列的事實，並受益於優化的正則表達式。 希望我有兩個最佳答案檢查可以發出！ 這給出了正確的結果，並且花費了模擬器上純正則表達式方法的大約一半的時間（我必須稍后在設備上進行測試，以查看 iPhone 4 上的時間比較，這是參考設備）：

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

編輯：這是另一個版本，它不使用正則表達式並減少了該方法的時間：

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

Answer 4

查看您提供的字典文件，我想說一個合理的策略是讀入數據並將其放入任何類型的持久數據存儲中。

通讀文件並為每個唯一的單詞創建對象，其中包含n個發音字符串（其中n是唯一發音的數量）。 字典已經是按字母順序排列的，所以如果你按照閱讀的順序解析它，你最終會得到一個按字母順序排列的列表。

然后你可以對數據進行二分搜索——即使有大量的對象，二分搜索也會很快找到你要找的東西（假設按字母順序）。

如果您需要閃電般的性能，您甚至可以將整個內容保存在內存中。

正則表達式模式和/或 NSRegularExpression 在非常大的文件中搜索有點太慢，可以優化嗎？

問題描述

4 個解決方案

解決方案1
4 2012-08-14 17:22:22

解決方案2
4 2012-08-14 18:51:09

解決方案3
2 2012-08-15 13:52:59

解決方案4
1 2012-08-14 17:26:18

正則表達式模式和/或 NSRegularExpression 在非常大的文件中搜索有點太慢，可以優化嗎？

問題描述

4 個解決方案

解決方案1 4 2012-08-14 17:22:22

解決方案2 4 2012-08-14 18:51:09

解決方案3 2 2012-08-15 13:52:59

解決方案4 1 2012-08-14 17:26:18

解決方案1
4 2012-08-14 17:22:22

解決方案2
4 2012-08-14 18:51:09

解決方案3
2 2012-08-15 13:52:59

解決方案4
1 2012-08-14 17:26:18