簡體   English   中英

從給定的單詞集中獲取單詞進行校對

[英]Get words from a given word collection for proofreading

我有一個存儲在List對象中的單詞集合,例如這里的title集合

Lorem Ipsum
Centuries
Electronic

這是示例段落,我要在其中查找這些單詞

我的目標是,我想提取該段中的那些單詞,不要緊,因為目標是糾正大寫和錯誤拼寫的單詞。

我的預期結果是

lorem ipsum
Loren Ipsum
centuries
electornic
LorenIpsum
LoremIpsum

但不限於這些,因為這會貫穿整篇文章以及成百上千的文章

抱歉,我沒有任何書面代碼,但是我打算在這里使用RegEx for C#。

互聯網上有很多算法可以檢查兩個單詞之間的相似性。 GetEdits是其中之一。

可以使用以下代碼。 但是,它可能不是很有效。

static int GetEdits(string answer, string guess)
{
    guess = guess.ToLower();
    answer = answer.ToLower();

    int[,] d = new int[answer.Length + 1, guess.Length + 1];
    for (int i = 0; i <= answer.Length; i++)
        d[i, 0] = i;
    for (int j = 0; j <= guess.Length; j++)
        d[0, j] = j;
    for (int j = 1; j <= guess.Length; j++)
        for (int i = 1; i <= answer.Length; i++)
            if (answer[i - 1] == guess[j - 1])
                d[i, j] = d[i - 1, j - 1];  //no operation
            else
                d[i, j] = Math.Min(Math.Min(
                    d[i - 1, j] + 1,    //a deletion

                    d[i, j - 1] + 1),   //an insertion

                    d[i - 1, j - 1] + 1 //a substitution

                );
    return d[answer.Length, guess.Length];
}

static void Main(string[] args)
{
    const string text = @"lorem ipsum is simply dummy text of the printing and typesetting industry. Loren Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing LorenIpsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of LoremIpsum.";

    var findWords = new string[]
    {
        "Lorem Ipsum",
        "Centuries",
        "Electronic"
    };

    const int MaxErrors = 2;

    // Tokenize text
    var tokens = text.Split(' ', ',' , '.');

    for (int i = 0; i < tokens.Length; i++)
    {
        if( tokens[i] != String.Empty)
        {
            foreach (var findWord in findWords)
            {
                if (GetEdits(findWord, tokens[i]) <= MaxErrors)
                {
                    Console.WriteLine(tokens[i]);
                    break;
                }
                // Join with the next word and check again.
                else if(findWord.Contains(' ') && i + 1 < tokens.Length)
                {
                    string token = tokens[i] + " " + tokens[i + 1];
                    if (GetEdits(findWord, token) <= MaxErrors)
                    {
                        Console.WriteLine(token);
                        i++;
                        break;
                    }
                }
            }
        }
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM