[英]Fastest way to distinct count a group based on another variable in the same line(distinct count of visitors to a page)
I have a file containing two columns; 我有一个包含两列的文件; visitorId and pageID.
visitorId和pageID。 What I would like to find is number of unique/distinct visitors for every page.
我想找到的是每个页面的唯一身份/独立访客数。 I am using HashTable in HashTable(dictionary) to track if that specific visitor has been counted or not for that speceific page.
我在HashTable(dictionary)中使用HashTable来跟踪该特定页面是否已计入该特定访问者。 The file contains more than 1 Billion lines so performance is very critical.
该文件包含超过10亿行,因此性能非常关键。 Is there any other data structure for counting distinct visitors other than HashTable in HashTable?
除了HashTable中的HashTable之外,是否还有其他数据结构可用来计算不同的访问者?
I have to solve this problem on files so importing to database is not an option. 我必须解决文件上的此问题,因此无法导入数据库。 Development enviroment is .NET and language is C#.
开发环境是.NET,语言是C#。
You can find the code below: 您可以在下面找到代码:
Dictionary<int, Dictionary<int, bool>> dicVisitorCount = new Dictionary<int, Dictionary<int, bool>>();
Dictionary<int, int> dicPages = new Dictionary<int, int>();
int million = 1000000;
for (int i = 0; i < 10 * million; i++)
{
pageID = r.Next(1, 100000);
visitorID = r.Next(1, 1000000);
if (!dicPages.ContainsKey(pageID))
{
dicPages.Add(pageID, 1);
Dictionary<int, bool> dicVisitors = new Dictionary<int, bool>();
dicVisitors.Add(visitorID, true);
dicVisitorCount.Add(pageID, dicVisitors);
}
else
{
if (!dicVisitorCount[pageID].ContainsKey(visitorID))
{
dicVisitorCount[pageID].Add(visitorID, true);
dicPages[pageID]++;
}
}
}
As a minor issue, I'd prefer a Dictionary
of int
to HashSet
as opposed to a Dictionary
of int
to Dictionary
(the mapping functionality of a Dictionary
is unnecessary here). 作为一个小问题,我宁愿一个
Dictionary
的int
到HashSet
,而不是一个Dictionary
的int
到Dictionary
(一的映射功能Dictionary
是没有必要在这里)。
If you don't care about an exact result, a Dictionary
of int
to bloom filter could also be a consideration (with a separate count to keep track of how many elements are in each of the bloom filters). 如果您不关心确切的结果,则还可以考虑使用一个“
int
到bloom过滤器的Dictionary
”(使用单独的计数来跟踪每个bloom过滤器中有多少个元素)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.