简体   繁体   中英

Selecting text from Dictionary - single words vs phrases - e.g. 'rice' vs 'rice wine'

Problem

I am writing a recipe parser in C#. I am selecting text inside a Rich Text Box where recipe ingredients are matched with Dictionary entries. I'm not sure how to deal with (or describe) the case where single words are matched (and double counted) inside a phrase that is also in the Dictionary

Example

In my Dictionary I have entries for 'rice' and 'rice wine'. I want to make sure that 'rice' is not matched in phrases that all already in the Dictionary like 'rice wine'. That is, the 'rice' part of 'rice wine' is not matched with the single 'rice' entry.

Terminology

I'd imagine this is a pretty usual case for text retrieval but I don't know what domain terminology would be.

Code

Currently I'm loading the Dictionary from an SQL query

tagList.Add(new KeyValuePair<string, string>(reader[0].ToString(), "0"));

And then searching the RichTextBox by looping the Dictionary and then looping thro9ugh the RTB.

foreach (KeyValuePair<string, string> word in tagList)
{
    int startindex = 0;
    while (startindex < richTextBox1.TextLength)
    {
        int wordstartIndex = richTextBox1.Find(word.Key, startindex, RichTextBoxFinds.WholeWord);
        if (wordstartIndex != -1)
        {
            Console.WriteLine("found: " + word.Key);

            richTextBox1.SelectionStart = wordstartIndex;
            richTextBox1.SelectionLength = word.Key.Length;
            if (word.Value.ToString() == "0")
            {
                richTextBox1.SelectionBackColor = Color.Yellow;
            }
        }
        else
            break;
        startindex += wordstartIndex + word.Key.Length;
    }
}

Use a SortedList instead of a Dictionary, so that "rice" will be right before "rice wine" and any other matching multiple words. When you find a match for "rice", enter a second loop where you peek the next elements from the list and look for matches with multiple words.

I refactored my lookup database table and made 4 columns for tags with one, two, three and four words - eg 'rice', rice wine', 'rice wine vinegar' and 'sour rice wine noodles'

I used 4 dictionaries and loaded each dictionary with the corresponding column from the database lookup table.

I looked at my target string with the four word dictionary first, then the three word dictionary, then two then the one word dictionary.

I used Regex's whole word boundary pattern "\b" + word.key + "\b" to tokenise the word.

Slow but it does the job for now.

foreach (KeyValuePair<string, string> word in tagTwo)
{
    string ingredientString = richTextBox1.Text.ToLower();
    if (ingredientString.Contains(word.Key))
    {
        string input = ingredientString;
        string pattern = @"\b" + word.Key + "\\b";

        if (Regex.IsMatch(input, pattern) == true)
        {
            Console.WriteLine(pattern);
            string replace = "[[token]]";
            string output = Regex.Replace(input, pattern, replace);
            richTextBox1.Text = output;

            insertStringLine = "INSERT INTO ingredientCount (ingredientTag, tagCount) VALUES ('" + word.Key + "',1);" + Environment.NewLine;
            SQLiteCommand createSQL = new SQLiteCommand(insertStringLine.Replace(",)", ")"), conn);
            createSQL.ExecuteNonQuery();
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM