简体   繁体   English

这看起来像是MapReduce的工作……但我只是想不通

[英]This looks like a job for MapReduce…but I just can't figure it out

I've been battling with this for about 2 days now, and any help would be tremendously appreciated. 我已经为此奋斗了大约2天,任何帮助将不胜感激。 I currently have a very large MongoDB collection(over 100M documents) in the following format: 我目前有一个非常大的MongoDB集合(超过1亿个文档),格式如下:

[_id]
[date]
[score]
[meta1]
[text1]
[text2]
[text3]
[text4]
[meta2]

This isn't the exact data in there, I've obfuscated it a little for the purpose of this post, but the schema is identical, and no the format of that data cannot be changed, that's just the way it is. 这不是那里的确切数据,出于这篇文章的目的,我已经对其进行了一些混淆,但是架构是相同的,并且没有数据的格式无法更改,这就是它的方式。

There are a TON of duplicate entries in there, a job is running once a day day adding millions of entries to the database that may have the same data in the text fields but different values for the score, meta1, and meta2 fields. 那里有大量重复的条目,每天运行一次作业,将数百万个条目添加到数据库中,这些条目可能在文本字段中具有相同的数据,但score,meta1和meta2字段的值不同。 So I need to eliminate the duplicates and shoehorn everything into one collection without duplicate texts: 因此,我需要消除重复,并将所有内容都消除在一个没有重复文本的集合中:

First, I'm going to concatenate the text fields and hash the result, so I have no duplicates containing the same text fields(this part is easy and already works). 首先,我将串联文本字段并对结果进行哈希处理,因此我没有包含相同文本字段的重复项(这部分很简单并且已经可以使用)。

Here's where I'm struggling: The resulting collection will have an array of each unique meta1, which will in turn be an array containing the dates and scores matching it. 这就是我要努力的地方:结果集合将具有每个唯一meta1的数组,而后者又将是一个包含与之匹配的日期和分数的数组。

So if I have the following three documents in my collection now: 因此,如果现在我的收藏夹中包含以下三个文档:

[_id] => random mongoid
[date] => 12092010 
[score] => 3
[meta1] => somemetadatahere
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data

[_id] => random mongoid
[date] => 12092010
[score] => 5
[meta1] => othermetadata
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data1

[_id] => random mongoid
[date] => 12102010
[score] => 7
[meta1] => somemetadatahere  (same meta1 as the first document)
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2
[meta2] => uniquemeta2data

They should be reduced to this collection(indents are nested documents/arrays). 它们应简化为该集合(缩进是嵌套的文档/数组)。 The keys in the datas array come from the values of the meta1 field in the original collection: datas数组中的键来自原始集合中meta1字段的值:

[_id]=> (md5 hash of all the text fields)
[text1] => foo
[text2] => bar
[text3] => foo2
[text4] => bar2    
[datas]
    [somemetadatahere]
        [meta2] => uniquemeta2data
        [scores]
            [12092010]=>3
            [12102010]=>7
    [othermetadata]
        [meta2] => uniquemeta2data1   
        [scores]
            [12092010]=>3     

This seems like a perfect use case for a MapReduce job, but I'm having trouble wrapping my head around exactly how to do this. 这似乎是MapReduce作业的完美用例,但是我很难确定到底该怎么做。

Is anyone up for the challenge of helping me figure this out? 有谁愿意帮助我解决这个挑战?

Basically, this is the same problem as the well-known word frequency problem in mapreduce, but instead of using words, you use hashes (and a reference to the original entry): 基本上,这是与mapreduce中众所周知的单词频率问题相同的问题,但是您使用哈希(以及对原始条目的引用)代替了使用单词:

  • Map: Take the hash of each entry and map it onto the couple (hash, 1). 映射:取每个条目的哈希并将其映射到对(哈希,1)。 (To retrieve the original entry: create an object and use the original entry as a property). (要检索原始条目,请执行以下操作:创建一个对象,并将原始条目用作属性)。
  • Reduce: All hash entries will be collected into the same bucket, count the values for each couple (hash, 1). 减少:将所有哈希条目都收集到同一存储桶中,对每对夫妇的值(哈希1)进行计数。
  • Output the hashes, the original entry (stored in the object), and the count 输出散列,原始条目(存储在对象中)和计数

Analogy: the cat sat on the mat 打个比方:猫坐在垫子上

Map: 地图:

  • the - (hash(the), 1) -((hash(the),1)
  • cat -> (hash(cat), 1) cat->(hash(cat),1)
  • sat -> (hash(sat), 1) 坐->(hash(sat),1)
  • on -> (hash(on), 1) on->(hash(on),1)
  • the - (hash(the), 1) -((hash(the),1)
  • mat -> (hash(mat), 1) 垫->(hash(mat),1)

Intermediate: 中间:

  • the - (hash(the), 1) -((hash(the),1)
  • cat -> (hash(cat), 1) cat->(hash(cat),1)
  • sat -> (hash(sat), 1) 坐->(hash(sat),1)
  • on -> (hash(on), 1) on->(hash(on),1)
  • the - (hash(the), 1) -((hash(the),1)
  • mat -> (hash(mat), 1) 垫->(hash(mat),1)

Reduce: 降低:

  • (hash(the), 2) (哈希,2)
  • (hash(cat), 1) (哈希(猫),1)
  • (hash(sat), 1) (哈希(sat),1)
  • (hash(on), 1) (哈希(上),1)
  • (hash(mat), 1) (哈希(mat),1)

I think the MapReduce problem seems straight forward, which means I probably misunderstand your problem. 我认为MapReduce问题似乎直截了当,这意味着我可能误解了您的问题。 Here is how I see it anyway. 无论如何,这就是我的看法。

Divide up the original collection based on the text hash. 根据文本哈希划分原始集合。 Have each section focus on combining the resulting subset. 让每个部分专注于合并结果子集。

Here's some code from http://www.dashdashverbose.com/2009/01/mapreduce-with-javascript.html 这是来自http://www.dashdashverbose.com/2009/01/mapreduce-with-javascript.html的一些代码

I will try to edit this to fit your question. 我将尝试对其进行编辑以适合您的问题。

function myMapper(key, value) {
 var ret = [];
 var words = normalizeText(value).split(' ');
 for (var i=0; i<words.length; i++) {
  ret.push({key:words[i], value:1});
 }
 return ret;
}

function myReducer(intermediateKey, values) {
 var sum = 0;
 for (var i=0; i<values.length; i++) {
  sum += values[i];
 }
 return {key:intermediateKey, value:sum};
}

function normalizeText(s) {
 s = s.toLowerCase();
 s = s.replace(/[^a-z]+/g, ' ');
 return s;
}

var i = {};
i.atxt = "The quick brown fox jumped over the lazy grey dogs.";
i.btxt = "That's one small step for a man, one giant leap for mankind.";
i.ctxt = "Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go.";

var out = mapReduce(i, myMapper, myReducer);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM