简体   繁体   English

当我们事先不知道有多少个哈希集时,最好的方法是在c#中采用两个以上的哈希集的交集

[英]best way to take an intersection of more than two hashsets in c#, when we donot know before hand how many hashsets are there

I am making a boolean retrieval system for some large no. 我正在为一些较大的编号创建布尔检索系统。 of documents, in which i have made a dictionary of hashsets, and the the entries into the dictionary are the terms, and the hashsets contains the documentids in which the term was found. 文档,其中我制作了一个哈希集字典,字典中的条目是术语,哈希集包含在其中找到该术语的文档ID。 Now when i want to search for a single word, i will simply enter the word and i will index the dictionary using the entered word in query and print out the corresponding hashset. 现在,当我要搜索单个单词时,我只需输入单词,然后使用查询中输入的单词索引字典并打印出相应的哈希集。 But i also want to search for sentences, in this case i will split the query into individual words and index the dictionary by those words, now depending upon the number of words in the query, that many number of hash sets will be returned, now i will want to take an intersection of these hash sets so that i can return the document ids in which i find out the words in the query. 但是我也想搜索句子,在这种情况下,我会将查询拆分成单个单词,并用这些单词索引字典,现在取决于查询中的单词数量,现在将返回许多哈希集我将要采用这些哈希集的交集,以便可以返回在其中查找查询中的单词的文档ID。 My question is what is the best way to take intersection of these hash sets? 我的问题是采用这些哈希集相交的最佳方法是什么?

Currently i am putting the hash sets into a list, and then i take intersection of these n no. 目前,我正在将哈希集放入列表中,然后将这些n否相交。 of hashsets two at a time and then take the intersection of result of first two and then the third one and so on... 一次混合两个哈希集,然后取前两个结果的交集,然后取第三个结果,依此类推...

This is the code 这是代码

Dictionary<string, HashSet<string>> dt = new Dictionary<string, HashSet<string>>();//assume it is filled with data...

while (true)
            {
                Console.WriteLine("\n\n\nEnter the query you want to search");
                string inp = Console.ReadLine();
                string[] words = inp.Split(new Char[] { ' ', ',', '.', ':', '?', '!', '\t' });

                List<HashSet<string>> outparr = new List<HashSet<string>>();
                foreach(string w in words)
                {
                    HashSet<string> outp = new HashSet<string>();
                    if (dt.TryGetValue(w, out outp))
                    {
                        outparr.Add(outp);
                        Console.WriteLine("Found {0} documents.", outp.Count);
                        foreach (string s in outp)
                        {
                            Console.WriteLine(s);
                        }
                    }
                }

                HashSet<string> temp = outparr.First();
                foreach(HashSet<string> hs in outparr)
                {
                    temp = new HashSet<string>(temp.Intersect(hs));
                }

                Console.WriteLine("Output After Intersection:");
                Console.WriteLine("Found {0} documents: ", temp.Count);
                foreach(string s in temp)
                {
                    Console.WriteLine(s);
                }

            }

The principle that you are using is sound, but you can tweak it a bit. 您使用的原理是声音,但是您可以对其进行一些调整。

By sorting the hash sets on size, you can start with the smallest one, that way you can minimise the number of comparisons. 通过按大小对散列集进行排序,您可以从最小的散列集开始,这样可以最大程度地减少比较次数。

Instead of using the IEnumerable<>.Intersect method you can do the same thing in a loop, but using the fact that you already have a hash set. 您可以在循环中执行相同的操作,而不必使用IEnumerable<>.Intersect方法,而是使用已经具有哈希集的事实。 Checking if a value exists in a hash set is very fast, so you can just loop through the items in the smallest set and look for matching values in the next set, and put them in a new set. 检查哈希集中是否存在值非常快,因此您可以循环浏览最小集中的项目,然后在下一个集中寻找匹配的值,然后将它们放入新集中。

In the loop you can skip the first item as you start with that. 在循环中,您可以在开始时跳过第一项。 You don't need to intersect it with itself. 您无需将其与自己相交。

outparr = outparr.OrderBy(o => o.Count).ToList();

HashSet<string> combined = outparr[0];
foreach(HashSet<string> hs in outparr.Skip(1)) {
  HashSet<string> temp = new HashSet<string>();
  foreach (string s in combined) {
    if (hs.Contains(s)) {
      temp.Add(s);
    }
  }
  combined = temp;
}

IntersectWith is a good aproach. IntersectWith是一个很好的方法。 Like this: 像这样:

            HashSet<string> res = null;
            HashSet<string> outdictinary = null;
            foreach(string w in words)
            {
                if (dt.TryGetValue(w, out outdictinary))
                {
                    if( res==null)
                        res =new HashSet( outdictinary,outdictinary.Comparer);
                    else
                    {   
                        if (res.Count==0)
                             break;
                        res.IntersectWith(outdictinary);
                    }
                }
            }
            if (res==null) res = new HashSet();
            Console.WriteLine("Output After Intersection:");
            Console.WriteLine("Found {0} documents: ", res.Count);
            foreach(string s in res)
            {
                Console.WriteLine(s);
            }

To answer your question, it's possible that at one point you'll find a set of documents that contains words a, b and c and another set that contains only other words in your query so the intersection can become empty after a few iterations. 为了回答您的问题,有可能在某一点上找到一组包含单词a,bc的文档,而另一组只包含查询中其他单词的文档,因此交集在几次迭代后可能会变空。 You can check for this and break out of the foreach . 您可以检查一下并break foreach

Now, IMHO it doesn't make sense to do that intersection because usualy a search result should contain multiple files ordered by relevance. 现在,恕我直言,进行该交叉没有意义,因为通常搜索结果应包含按相关性排序的多个文件。 It will also be much easier because you already have a list of files containing one word. 因为您已经有了包含一个单词的文件列表,这也将更加容易。 From the hashes obtained for each word you'll have to count the occurences of file ids and return a limited number of ids ordered descending by the number of occurences. 从每个单词获得的哈希值中,您将必须计算文件ID的出现次数,并返回有限数量的ID(按出现次数降序排列)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM