简体   繁体   中英

C# Dictionary - ContainsKey Function Return Wrong Value

Im trying to use Dictionary of for mapping some words (the int doesnt really so relevant). after inserting the word to the dic (I checked it) i try to go over the whole doc and look for a specific word.

when i do that, even if the word exist in dic, it return false.

what can be the problem and how can i fix it?

public string RemoveStopWords(string originalDoc){
        string updatedDoc = "";
        string[] originalDocSeperated = originalDoc.Split(' ');
        foreach (string word in originalDocSeperated)
        {
            if (!stopWordsDic.ContainsKey(word))
            {
                updatedDoc += word;
                updatedDoc += " ";
            }
        }
        return updatedDoc.Substring(0, updatedDoc.Length - 1); //Remove Last Space
    }

for examle: the dic contains stop words as the word "the". when i get a word "the" from the originalDoc and then wanna check if it is not exist, it still enter the IF statement And both of them write the same! no case sensitivity

Dictionary<string, int> stopWordsDic = new Dictionary<string, int>();

string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
            string[] stopWordsSeperated = stopWordsContent.Split('\n');
            foreach (string stopWord in stopWordsSeperated)
            {
                stopWordsDic.Add(stopWord, 1);
            }

The stopWords file is a file which in each line there is a word

snapshot: 在此处输入图片说明

thank you

This is just a guess (just too long for a comment), but when you are inserting on your Dictionary , you are splitting by \\n .

So if the actual splitter in the text file you are using is \\r\\n , you'd be left with \\r 's on your inserted keys, thus not finding them on ContainsKey .

So I'd start with a string[] stopWordsSeperated = stopWordsContent.Split(new string[] { "\\r\\n", "\\n" }, StringSplitOptions.None); then trim


As a side note, if you are not using the dictionary int values for anything, you'd be better of using a HashSet<string> and Contains instead of ContainsKey

You have a ! (not) operator in your if statement. You're checking to see if the dictionary does Not contain a key. Remove the exclamation mark from the start of your condition.

When you create the dictionary you would need to do the following:

var stopWords= new Dictionary<string, int>(
    StringComparer.InvariantCultureIgnoreCase);

The most important part is the InvariantCultureIgnoreCase.

public string RemoveStopWords(string originalDoc){
    return String.Join(" ", 
           originalDoc.Split(' ')
              .Where(x => !stopWordsDic.ContainsKey(x))
    );
}

Furthermore you should change how you fill the dictionary (this eliminates all non word symbols from your dictionary when creating it):

        // Regex to find the first word inside a string regardless of the 
        // preleading symbols. Cuts away all nonword symbols afterwards
        Regex validWords = New Regex(@"\b([0-9a-zA-Z]+?)\b");

        string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
        string[] stopWordsSeperated = stopWordsContent.Split('\n');

        foreach (string stopWord in stopWordsSeperated)
        {
            stopWordsDic.Add(validWords.Match(stopWord).Value, 1);
        }

I see that you're setting 1 as the value for all entries. Maybe a List would better fit your needs:

List<string> stopWordsDic = new List<string>();

string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
string[] stopWordsSeperated = stopWordsContent.Split(Environment.NewLine);
foreach (string stopWord in stopWordsSeperated)
{
    stopWordsDic.Add(stopWord);
}

Then check for element with Contains()

public string RemoveStopWords(string originalDoc){
    string updatedDoc = "";
    string[] originalDocSeperated = originalDoc.Split(' ');
    foreach (string word in originalDocSeperated)
    {
        if (!stopWordsDic.Contains(word))
        {
            string.Format("{0}{1}", word, string.Empty);
            //updatedDoc += word;
            //updatedDoc += " ";
        }
    }
    return updatedDoc.Substring(0, updatedDoc.Length - 1); //Remove Last Space
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM