![](/img/trans.png)
[英]C# Reading strings from a file and finding matches between Start string and an End String different by one character
[英]Optimal Compare Algorithm for finding string matches in List of strings C#
假設我有100,000個字的清單。 我想找出給定的字符串是否匹配該列表中的任何單詞,並且我想以最快的方式做到這一點。 我也想知道在該字符串中是否出現以第一個字符開頭的其他單詞。
例如:
假設您有字符串“ icedtgg”
“ i”“ ic”“ ice”“ iced”“ icedt”“ icedtg”“ icedtgg”
我正在嘗試提出一種最佳比較算法,該算法可以告訴我以下列表中是否包含以下單詞。
到目前為止,我的100,000個單詞列表存儲在
Dicitonary<char, List<string>> WordList;
其中char
是單詞的第一個字符,而List<string>
是所有以該字符開頭的單詞。
因此, WordList['a']
包含以'a'開頭的所有單詞的列表(“ ape”,“ apple”,“ art”等),'b'包含以b等開頭的所有單詞的列表。
因為我知道我所有的單詞都以“ i”開頭,所以我可以先將解決方案的范圍從100,000個單詞縮小到以“ i”開頭的單詞。
List<string> CurrentWordList = WordList['i'];
現在我檢查
if( CurrentWordList[0].Length == 1 )
然后我知道我的第一個字符串是匹配項“ i”,因為“ i”將成為列表中的第一個單詞。 這些列表事先按字母順序排序,以免減慢匹配速度。
有任何想法嗎?
*不,這不是硬件任務,我是一位專業的軟件架構師,試圖為娛樂/愛好/游戲開發找到最佳匹配算法。
我決定添加此答案,不是因為它是解決問題的最佳方法,而是為了說明兩種可能的解決方案,這些解決方案相對簡單,並且與您似乎在遵循自己的方法相符。
下面的(未優化)示例提供了一個非常簡單的前綴Trie實現,該實現使用每個消耗的字符一個節點。
public class SimplePrefixTrie
{
private readonly Node _root = new Node(); // root represents empty string.
private class Node
{
public Dictionary<char, Node> Children;
public bool IsTerminal; // whether a full word ends here.
public Node Find(string word, int index)
{
var child = default(Node);
if (index < word.Length && Children != null)
Children.TryGetValue(word[index], out child);
return child;
}
public Node Add(string word, int toConsume)
{
var child = default(Node);
if (toConsume == word.Length)
this.IsTerminal = true;
else if (Children == null || !Children.TryGetValue(word[toConsume], out child))
{
if (Children == null)
Children = new Dictionary<char, Node>();
Children[word[toConsume]] = child = new Node();
}
return child;
}
}
public void AddWord(string word)
{
var ndx = 0;
var cur = _root;
while (cur != null)
cur = cur.Add(word, ndx++);
}
public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
{
var ndx = 0;
var cur = _root;
while (cur != null)
{
if (cur.IsTerminal)
yield return searchWord.Substring(0, ndx);
cur = cur.Find(searchWord, ndx++);
}
}
}
下面還添加了壓縮前綴trie的簡單實現。 它采用與上面的示例幾乎相同的方法,但是存儲共享的前綴部分,而不是單個字符。 當現有存儲的前綴變為共享並且需要分為兩部分時,它將拆分節點。
public class SimpleCompressedPrefixTrie
{
private readonly Node _root = new Node();
private class Node
{
private Dictionary<char, Node> _children;
public string PrefixValue = string.Empty;
public bool IsTerminal;
public Node Add(string word, ref int startIndex)
{
var n = FindSharedPrefix(word, startIndex);
startIndex += n;
if (n == PrefixValue.Length) // full prefix match
{
if (startIndex == word.Length) // full match
IsTerminal = true;
else
return AddToChild(word, ref startIndex);
}
else // partial match, need to split this node's prefix.
SplittingAdd(word, n, ref startIndex);
return null;
}
public Node Find(string word, ref int startIndex, out int matchLen)
{
var n = FindSharedPrefix(word, startIndex);
startIndex += n;
matchLen = -1;
if (n == PrefixValue.Length)
{
if (IsTerminal)
matchLen = startIndex;
var child = default(Node);
if (_children != null && startIndex < word.Length && _children.TryGetValue(word[startIndex], out child))
{
startIndex++; // consumed map key character.
return child;
}
}
return null;
}
private Node AddToChild(string word, ref int startIndex)
{
var key = word[startIndex++]; // consume the mapping character
var nextNode = default(Node);
if (_children == null)
_children = new Dictionary<char, Node>();
else if (_children.TryGetValue(key, out nextNode))
return nextNode;
var remainder = word.Substring(startIndex);
_children[key] = new Node() { PrefixValue = remainder, IsTerminal = true };
return null; // consumed.
}
private void SplittingAdd(string word, int n, ref int startIndex)
{
var curChildren = _children;
_children = new Dictionary<char, Node>();
_children[PrefixValue[n]] = new Node()
{
PrefixValue = this.PrefixValue.Substring(n + 1),
IsTerminal = this.IsTerminal,
_children = curChildren
};
PrefixValue = PrefixValue.Substring(0, n);
IsTerminal = startIndex == word.Length;
if (!IsTerminal)
{
var prefix = word.Length > startIndex + 1 ? word.Substring(startIndex + 1) : string.Empty;
_children[word[startIndex]] = new Node() { PrefixValue = prefix, IsTerminal = true };
startIndex++;
}
}
private int FindSharedPrefix(string word, int startIndex)
{
var n = Math.Min(PrefixValue.Length, word.Length - startIndex);
var len = 0;
while (len < n && PrefixValue[len] == word[len + startIndex])
len++;
return len;
}
}
public void AddWord(string word)
{
var ndx = 0;
var cur = _root;
while (cur != null)
cur = cur.Add(word, ref ndx);
}
public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
{
var startNdx = 0;
var cur = _root;
while (cur != null)
{
var matchLen = 0;
cur = cur.Find(searchWord, ref startNdx, out matchLen);
if (matchLen > 0)
yield return searchWord.Substring(0, matchLen);
};
}
}
用法示例:
var trie = new SimplePrefixTrie(); // or new SimpleCompressedPrefixTrie();
trie.AddWord("hello");
trie.AddWord("iced");
trie.AddWord("i");
trie.AddWord("ice");
trie.AddWord("icecone");
trie.AddWord("dtgg");
trie.AddWord("hicet");
foreach (var w in trie.FindWordsMatchingPrefixesOf("icedtgg"))
Console.WriteLine(w);
輸出:
i
ice
iced
更新:選擇正確的數據結構很重要
我認為更新可以提供一些價值,以說明選擇適合問題的數據結構如何重要以及涉及哪些折衷方案。 因此,我創建了一個小型基准應用程序,該應用程序測試了迄今為止提供給該問題的答案中的策略以及基准參考實現。
完整的基准代碼可在本要點中找到。 使用10,000、100,000和1,000,000(隨機生成的字符序列)單詞的字典運行它並搜索5,000個詞的所有前綴匹配項的結果是:
將5000個單詞與最大長度為10000的字典進行匹配25
Method Memory (MB) Build Time (s) Lookup Time (s)
Naive 0.64-0.64, 0.64 0.001-0.002, 0.001 6.136-6.312, 6.210
JimMischel 0.84-0.84, 0.84 0.013-0.018, 0.016 0.083-0.113, 0.102
JimMattyDSL 0.80-0.81, 0.80 0.013-0.018, 0.016 0.008-0.011, 0.010
SimpleTrie 24.55-24.56, 24.56 0.042-0.056, 0.051 0.002-0.002, 0.002
CompessedTrie 1.84-1.84, 1.84 0.003-0.003, 0.003 0.002-0.002, 0.002
MattyMerrix 0.83-0.83, 0.83 0.017-0.017, 0.017 0.034-0.034, 0.034
將5000個單詞與100000個最大長度的詞的詞典匹配25
Method Memory (MB) Build Time (s) Lookup Time (s)
Naive 6.01-6.01, 6.01 0.024-0.026, 0.025 65.651-65.758, 65.715
JimMischel 6.32-6.32, 6.32 0.232-0.236, 0.233 1.208-1.254, 1.235
JimMattyDSL 5.95-5.96, 5.96 0.264-0.269, 0.266 0.050-0.052, 0.051
SimpleTrie 226.49-226.49, 226.49 0.932-0.962, 0.951 0.004-0.004, 0.004
CompessedTrie 16.10-16.10, 16.10 0.101-0.126, 0.111 0.003-0.003, 0.003
MattyMerrix 6.15-6.15, 6.15 0.254-0.269, 0.259 0.414-0.418, 0.416
將5000個單詞與最大長度為1000000的詞典匹配25
Method Memory (MB) Build Time (s) Lookup Time (s)
JimMischel 57.69-57.69, 57.69 3.027-3.086, 3.052 16.341-16.415, 16.373
JimMattyDSL 60.88-60.88, 60.88 3.396-3.484, 3.453 0.399-0.400, 0.399
SimpleTrie 2124.57-2124.57, 2124.57 11.622-11.989, 11.860 0.006-0.006, 0.006
CompessedTrie 166.59-166.59, 166.59 2.813-2.832, 2.823 0.005-0.005, 0.005
MattyMerrix 62.71-62.73, 62.72 3.230-3.270, 3.251 6.996-7.015, 7.008
如您所見,(非空間優化的)嘗試所需的內存明顯更高。 對於所有測試的實現,它都會增加字典的大小O(N)。
不出所料,嘗試的查找時間或多或少是恆定的:O(k),僅取決於搜索詞的長度。 對於其他實現,時間將基於要搜索的字典的大小而增加。
請注意,可以構造出針對此問題的更為理想的實現,對於搜索時間,該實現將接近O(k),並允許更緊湊的存儲和減少的內存占用。 如果您映射到一個簡化的字母(例如,僅'A'-'Z'),那么這也是可以利用的。
因此,您只想在字典中找到作為輸入字符串前綴的單詞? 您可以比建議的任何方法更加有效地執行此操作。 它實際上只是一個修改的合並。
如果您的單詞列表由以第一個字母為鍵的字典組成,並且每個條目都包含以該字母開頭的單詞的排序列表,則可以這樣做。 最糟糕的情況是O(n + m),其中n是以字母開頭的單詞數,m是輸入字符串的長度。
var inputString = "icegdt";
// get list of words that start with the first character
var wordsList = MyDictionary[input_string[0]];
// find all words that are prefixes of the input string
var iInput = 0;
var iWords = 0;
var prefix = inputString.Substring(0, iInput+1);
while (iInput < inputString.Length && iWords < wordsList.Count)
{
if (wordsList[iWords] == prefix)
{
// wordsList[iWords] is found!
++iWords;
}
else if (wordsList[iWords] > prefix)
{
// The current word is alphabetically after the prefix.
// So we need the next character.
++iInput;
if (iInput < inputString.Length)
{
prefix = inputString.Substring(0, iInput+1);
}
}
else
{
// The prefix is alphabetically after the current word.
// Advance the current word.
++iWord;
}
}
如果這是您要做的所有事情(查找作為輸入字符串前綴的詞典詞),則沒有特殊原因要使您的詞典由第一個字符索引。 給定單詞的排序列表,您可以對第一個字母進行二進制搜索以找到起點。 這將花費稍多的時間比字典查找,但比起花在尋找匹配的單詞列表的時間的時間差將是非常小的。 此外,與字典方法相比,排序的單詞列表將占用更少的內存。
如果要進行不區分大小寫的比較,請將比較代碼更改為:
var result = String.Compare(wordsList[iWords], prefix, true);
if (result == 0)
{
// wordsList[iWords] is found!
++iWords;
}
else if (result > 0)
{
這也將每次迭代的字符串比較次數減少到每次迭代恰好一次。
while (x < str.Length-1)
{
if (ChrW(10) == GetChar(str, x) && ChrW(13) == GetChar(str, x+1))
{
// x+2 - This new line
}
x++;
}
這是我的第一步,想把它拿出來,以防萬一我今天不能完成。
public class CompareHelper
{
//Should always be sorted in alphabetical order.
public static Dictionary<char, List<string>> MyDictionary;
public static List<string> CurrentWordList;
public static List<string> MatchedWordList;
//The word we are trying to find matches for.
public static char InitChar;
public static StringBuilder ThisWord;
/// <summary>
/// Initialize the Compare. Set the first character. See if there are any 1 letter words
/// for that character.
/// </summary>
/// <param name="firstChar">The first character in the word string.</param>
/// <returns>True if a word was found.</returns>
public static bool InitCompare(char firstChar)
{
InitChar = firstChar;
//Get all words that start with the firstChar.
CurrentWordList = MyDictionary[InitChar];
ThisWord = new StringBuilder();
ThisWord.Append(firstChar);
if (CurrentWordList[0].Length == 1)
{
//Match.
return true;
}
//No matches.
return false;
}
/// <summary>
/// Append this letter to our ThisWord. See if there are any matching words.
/// </summary>
/// <param name="nextChar">The next character in the word string.</param>
/// <returns>True if a word was found.</returns>
public static bool NextCompare(char nextChar)
{
ThisWord.Append(nextChar);
int currentIndex = ThisWord.Length - 1;
if (FindRemainingWords(nextChar, currentIndex))
{
if (CurrentWordList[0].Length == currentIndex)
{
//Match.
return true;
}
}
//No matches.
return false;
}
/// <summary>
/// Trim down our CurrentWordList until it only contains words
/// that at currIndex start with the currChar.
/// </summary>
/// <param name="currChar">The next letter in our ThisWord.</param>
/// <param name="currIndex">The index of the letter.</param>
/// <returns>True if there are words remaining in CurrentWordList.</returns>
private static bool FindRemainingWords(char currChar, int currIndex)
{
//Null check.
if (CurrentWordList == null || CurrentWordList.Count < 1)
{
return false;
}
bool doneSearching = false;
while(!doneSearching)
{
int middleIndex = CurrentWordList.Count / 2;
//TODO: test for CurrentWordList.count 2 or 1 ...
//TODO: test for wordToCheck.length < curr index
char middleLetter = CurrentWordList[middleIndex][currIndex];
LetterPositionEnum returnEnum = GetLetterPosition(currChar, middleLetter);
switch(returnEnum)
{
case LetterPositionEnum.Before:
CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));
break;
case LetterPositionEnum.PREV:
CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));
break;
case LetterPositionEnum.MATCH:
CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));
break;
case LetterPositionEnum.NEXT:
CurrentWordList = CurrentWordList.GetRange(0, middleIndex);
break;
case LetterPositionEnum.After:
CurrentWordList = CurrentWordList.GetRange(0, middleIndex);
break;
default:
break;
}
}
TrimWords(currChar, currIndex);
//Null check.
if (CurrentWordList == null || CurrentWordList.Count < 1)
{
return false;
}
//There are still words left in CurrentWordList.
return true;
}
//Trim all words in CurrentWordList
//that are LetterPositionEnum.PREV and LetterPositionEnum.NEXT
private static void TrimWords(char currChar, int currIndex)
{
int startIndex = 0;
int endIndex = CurrentWordList.Count;
bool startIndexFound = false;
//Loop through all of the words.
for ( int i = startIndex; i < endIndex; i++)
{
//If we havent found the start index then the first match of currChar
//will be the start index.
if( !startIndexFound && currChar == CurrentWordList[i][currIndex] )
{
startIndex = i;
startIndexFound = true;
}
//If we have found the start index then the next letter that isnt
//currChar will be the end index.
if( startIndexFound && currChar != CurrentWordList[i][currIndex])
{
endIndex = i;
break;
}
}
//Trim the words that dont start with currChar.
CurrentWordList = CurrentWordList.GetRange(startIndex, endIndex);
}
//In order to find all words that begin with a given character, we should search
//for the last word that begins with the previous character (PREV) and the
//first word that begins with the next character (NEXT).
//Anything else Before or After that is trash and we will throw out.
public enum LetterPositionEnum
{
Before,
PREV,
MATCH,
NEXT,
After
};
//We want to ignore all letters that come before this one.
public static LetterPositionEnum GetLetterPosition(char currChar, char compareLetter)
{
switch (currChar)
{
case 'A':
switch (compareLetter)
{
case 'A': return LetterPositionEnum.MATCH;
case 'B': return LetterPositionEnum.NEXT;
case 'C': return LetterPositionEnum.After;
case 'D': return LetterPositionEnum.After;
case 'E': return LetterPositionEnum.After;
case 'F': return LetterPositionEnum.After;
case 'G': return LetterPositionEnum.After;
case 'H': return LetterPositionEnum.After;
case 'I': return LetterPositionEnum.After;
case 'J': return LetterPositionEnum.After;
case 'K': return LetterPositionEnum.After;
case 'L': return LetterPositionEnum.After;
case 'M': return LetterPositionEnum.After;
case 'N': return LetterPositionEnum.After;
case 'O': return LetterPositionEnum.After;
case 'P': return LetterPositionEnum.After;
case 'Q': return LetterPositionEnum.After;
case 'R': return LetterPositionEnum.After;
case 'S': return LetterPositionEnum.After;
case 'T': return LetterPositionEnum.After;
case 'U': return LetterPositionEnum.After;
case 'V': return LetterPositionEnum.After;
case 'W': return LetterPositionEnum.After;
case 'X': return LetterPositionEnum.After;
case 'Y': return LetterPositionEnum.After;
case 'Z': return LetterPositionEnum.After;
default: return LetterPositionEnum.After;
}
case 'B':
switch (compareLetter)
{
case 'A': return LetterPositionEnum.PREV;
case 'B': return LetterPositionEnum.MATCH;
case 'C': return LetterPositionEnum.NEXT;
case 'D': return LetterPositionEnum.After;
case 'E': return LetterPositionEnum.After;
case 'F': return LetterPositionEnum.After;
case 'G': return LetterPositionEnum.After;
case 'H': return LetterPositionEnum.After;
case 'I': return LetterPositionEnum.After;
case 'J': return LetterPositionEnum.After;
case 'K': return LetterPositionEnum.After;
case 'L': return LetterPositionEnum.After;
case 'M': return LetterPositionEnum.After;
case 'N': return LetterPositionEnum.After;
case 'O': return LetterPositionEnum.After;
case 'P': return LetterPositionEnum.After;
case 'Q': return LetterPositionEnum.After;
case 'R': return LetterPositionEnum.After;
case 'S': return LetterPositionEnum.After;
case 'T': return LetterPositionEnum.After;
case 'U': return LetterPositionEnum.After;
case 'V': return LetterPositionEnum.After;
case 'W': return LetterPositionEnum.After;
case 'X': return LetterPositionEnum.After;
case 'Y': return LetterPositionEnum.After;
case 'Z': return LetterPositionEnum.After;
default: return LetterPositionEnum.After;
}
case 'C':
switch (compareLetter)
{
case 'A': return LetterPositionEnum.Before;
case 'B': return LetterPositionEnum.PREV;
case 'C': return LetterPositionEnum.MATCH;
case 'D': return LetterPositionEnum.NEXT;
case 'E': return LetterPositionEnum.After;
case 'F': return LetterPositionEnum.After;
case 'G': return LetterPositionEnum.After;
case 'H': return LetterPositionEnum.After;
case 'I': return LetterPositionEnum.After;
case 'J': return LetterPositionEnum.After;
case 'K': return LetterPositionEnum.After;
case 'L': return LetterPositionEnum.After;
case 'M': return LetterPositionEnum.After;
case 'N': return LetterPositionEnum.After;
case 'O': return LetterPositionEnum.After;
case 'P': return LetterPositionEnum.After;
case 'Q': return LetterPositionEnum.After;
case 'R': return LetterPositionEnum.After;
case 'S': return LetterPositionEnum.After;
case 'T': return LetterPositionEnum.After;
case 'U': return LetterPositionEnum.After;
case 'V': return LetterPositionEnum.After;
case 'W': return LetterPositionEnum.After;
case 'X': return LetterPositionEnum.After;
case 'Y': return LetterPositionEnum.After;
case 'Z': return LetterPositionEnum.After;
default: return LetterPositionEnum.After;
}
//etc. Stack Overflow limits characters to 30,000 contact me for full switch case.
default: return LetterPositionEnum.After;
}
}
}
好的,這是我想出的最終解決方案,我不確定這是否是Optimal Optimal,但似乎還算快,我喜歡邏輯並且喜歡代碼的簡潔。
基本上在App啟動時,您可以將任意長度的單詞列表傳遞給InitWords。 這將對單詞進行排序,並將其放入具有26個鍵的詞典中,每個字母對應一個字母。
然后在播放過程中,您將迭代字符集,始終從第一個字母開始,然后從第一個和第二個字母開始,依此類推。 整個過程中,您都會減少CurrentWordList中的單詞數。
因此,如果您有字符串“ icedgt”。 您將用'i'調用InitCompare,這將從MyDictionary中獲取具有鍵'I'的KeyValuePair,然后您將看到第一個單詞的長度是否為1,因為它們已經按字母順序排列,所以單詞'I'將是第一個字。 然后在下一次迭代中,將“ c”傳遞給NextCompare,這再次通過使用Linq僅返回具有第二個字符“ c”的單詞來減小列表大小。 接下來,您將執行另一個NextCompare並傳入'e',再次使用Linq減少CurrentWordList中的單詞數。
因此,在第一次迭代之后,您的CurrentWordList包含每個以'i'開頭的單詞,在NextCompare上,您將具有以'ic'開頭的所有單詞,在NextCompare上,您將具有其中每個單詞以'ice'開頭的子集等等。
我不確定Linq是否會在速度上擊敗我的手動Switch Case,但它既簡單又優雅。 為此,我很高興。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Xuzzle.Code
{
public class CompareHelper
{
//Should always be sorted in alphabetical order.
public static Dictionary<char, List<string>> MyDictionary;
public static List<string> CurrentWordList;
//The word we are trying to find matches for.
public static char InitChar;
public static StringBuilder ThisWord;
/// <summary>
/// Init MyDictionary with the list of words passed in. Make a new
/// key value pair with each Letter.
/// </summary>
/// <param name="listOfWords"></param>
public static void InitWords(List<string> listOfWords)
{
MyDictionary = new Dictionary<char, List<string>>();
foreach (char currChar in LetterHelper.Alphabet)
{
var wordsParsed = listOfWords.Where(currWord => char.ToUpper(currWord[0]) == currChar).ToArray();
Array.Sort(wordsParsed);
MyDictionary.Add(currChar, wordsParsed.ToList());
}
}
/// <summary>
/// Initialize the Compare. Set the first character. See if there are any 1 letter words
/// for that character.
/// </summary>
/// <param name="firstChar">The first character in the word string.</param>
/// <returns>True if a word was found.</returns>
public static bool InitCompare(char firstChar)
{
InitChar = firstChar;
//Get all words that start with the firstChar.
CurrentWordList = MyDictionary[InitChar];
ThisWord = new StringBuilder();
ThisWord.Append(firstChar);
if (CurrentWordList[0].Length == 1)
{
//Match.
return true;
}
//No matches.
return false;
}
/// <summary>
/// Append this letter to our ThisWord. See if there are any matching words.
/// </summary>
/// <param name="nextChar">The next character in the word string.</param>
/// <returns>True if a word was found.</returns>
public static bool NextCompare(char nextChar)
{
ThisWord.Append(nextChar);
int currentIndex = ThisWord.Length - 1;
if (CurrentWordList != null && CurrentWordList.Count > 0)
{
CurrentWordList = CurrentWordList.Where(word => (word.Length > currentIndex && word[currentIndex] == nextChar)).ToList();
if (CurrentWordList != null && CurrentWordList.Count > 0)
{
if (CurrentWordList[0].Length == ThisWord.Length)
{
//Match.
return true;
}
}
}
//No matches.
return false;
}
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.