简体   繁体   English

用于查找在大数据集中经常出现的元素的启发式算法

[英]Heuristic for finding elements that appears often together in a big data set

Problem: 问题:

I have a list of millions of transactions. 我有数百万笔交易的清单。 Each transaction contains items (eg 'carrots', 'apples') the goal is to generate a list of pair of items that frequently occur together in individual transactions. 每个交易都包含项目(例如“胡萝卜”,“苹果”),目标是生成在个别交易中经常出现的一对项目列表。 As far as I can tell doing an exhaustive search isn't feasible. 据我所知,进行详尽的搜索是不可行的。

Solution attempts: 解决方案尝试

So far I have two ideas. 到目前为止,我有两个想法。 1) Randomly sample some appropriate fraction of transactions and only check those or 2) count how often each element appears, use that data to calculate how often elements should appear together by chance and use that to modify the estimate from 1. 1)随机抽样一些适当的交易部分,只检查那些或2)计算每个元素出现的频率,使用该数据计算元素偶然出现的频率,并用它来修改1的估计值。

Any tips, alternative approaches, ready-made solutions or just general reading suggestions are much appreciated. 非常感谢任何提示,替代方法,现成的解决方案或只是一般阅读建议。

Edit: 编辑:

Some additional information from the comments 评论中的一些其他信息

Number of diffent items: 1,000 to 100,000 不同项目数量:1,000到100,000

Memory constraint: A few gigs of ram at the most for a few hours. 记忆约束:最多只有几个小时的公羊。

Frequency of use: More or less a one off. 使用频率:或多或少一次性使用。

Available resources: 20-100 hours of newbie programmer time. 可用资源:20-100小时的新手程序员时间。

Desired result list format: Pair of items and some measure how often they appear, for the n most frequent pairs. 期望的结果列表格式:对于n个最频繁的对,项目对和一些测量它们出现的频率。

Distribution of items per transactions: Unknown as of now. 每笔交易的物品分配:截至目前未知。

Let the number of transactions be n , the number of items be k , and the average size of a transaction be d . 设交易次数为n ,项目数为k ,交易平均规模为d

The naive approach (check pair in all records) will give you O(k^2 * n * d) solution, not very optimal indeed. 天真的方法(所有记录中的检查对)将给你O(k^2 * n * d)解决方案,确实不是非常优化。 But we can improve it to O(k*n*d) , and if we assume uniform distribution of items (ie each items repeats on average O(n*d/k) times) - we might be able to improve it to O(d^2 * n + k^2) (which is much better, since most likely d << k ). 但是我们可以将它改进为O(k*n*d) ,如果我们假设项目的均匀分布(即每个项目平均重复O(n*d/k)次数) - 我们可能能够将其改进为O(d^2 * n + k^2) (这更好,因为很可能是d << k )。

This can be done by building an inverted index of your transactions, meaning - create a map from the items to the transactions containing them (Creating the index is O(nd + k) ). 这可以通过构建事务的倒排索引来完成,这意味着 - 创建从项目到包含它们的事务的映射(创建索引是O(nd + k) )。

Example, if you have transactions 例如,如果您有交易

transaction1 = ('apple','grape')
transaction2 = ('apple','banana','mango')
transaction3 = ('grape','mango')

The inverted index will be: 倒排索引将是:

'apple' -> [1,2]
'grape' -> [1,3]
'banana' -> [2]
'mango' -> [2,3]

So, after understanding what an inverted index is - here is the guidelines for the solution: 因此,在了解了倒排索引是什么之后 - 这里是解决方案指导原则:

  1. Build an inverted index for your data 为您的数据构建倒排索引
  2. For each item x, iterate all documents it appears in, and build a histogram for all the pairs (x,y) such that y co-occures with x . 对于每个项目x,迭代它出现的所有文档,并为所有对(x,y)构建直方图 ,使得yx共同发生。
  3. When you are done, you have a histogram containing k^2 items, which you need to process. 完成后,您有一个包含k ^ 2项的直方图,您需要处理这些项目。 This question discusses how to get top-k elements out of an unsorted list. 这个问题讨论了如何从未排序的列表中获取top-k元素。

Complexity analysis: 复杂性分析:

  1. Building an inverted index is O(nd+k) 建立倒排索引是O(nd+k)
  2. Assuming each element repeats in O(nd/k) transactions, each iteration takes O(nd/k * d) time, and you have k iterations in this step, so you get O(nd^2 + k) for this step. 假设每个元素在O(nd/k)事务中重复,每次迭代需要O(nd/k * d)时间,并且在此步骤中您有k次迭代,因此您获得此步骤的O(nd^2 + k)
  3. Processing the list can be done in O(k^2logk) if you want full ordering, or if you just want to print top X elements, it can be done in O(k^2) . 如果您想要完整排序,可以在O(k ^ 2logk)中处理列表,或者如果您只想打印前X个元素,则可以在O(k^2)

Totaling in O(nd^2 + k^2) solution to get top-X elements, which is MUCH better then naive approach, assuming d << k . 假设d << k ,在O(nd^2 + k^2)解中总计得到top-X元素,这比天真方法好得多。

In addition, note that the bottleneck (step 2) can be efficiently parallelized and distributed among threads if needed. 另外,请注意,如果需要,瓶颈(步骤2)可以有效地并行化并在线程之间分配。

If the number of items ordered in one purchase is small(<10) do this: 如果一次购买中订购的商品数量很少(<10),请执行以下操作:
have map of maps(dictionary of dictionaries) : key in the first map is item, 有地图的地图(字典词典):第一个地图中的关键是项目,
value in the first map is map whose key is second item, value count how many times it appeared in the purchase with first key. 第一个地图中的值是地图,其键是第二个项目,值计算在购买第一个键时出现的次数。

So go through every order and for every pair update map. 因此,请仔细检查每个订单和每对更新地图。 At the end go through map of maps and look for "big values" in the "second value" 最后浏览地图并在“第二个值”中查找“大值”

Note: depending on the size and "distribution"of input data you might end up with not enough memory 注意:根据输入数据的大小和“分布”,您可能最终没有足够的内存

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM