简体繁体 English

即时将输入数据添加到Hadoop Map-Reduce Job？

[英]Add input data on the fly to Hadoop Map-Reduce Job?

原文 2015-01-26 17:33:39 9 2 java/ hadoop/ hdfs

我可以在运行时将输入文件或输入数据追加到map-reduce作业中而不创建竞争条件吗？

2 个解决方案

I think in theory you can add more files into the input as long as it: 我认为理论上您可以在输入中添加更多文件，只要它可以：

Matches your FileInputFormat pattern 匹配您的FileInputFormat模式
Happens before InputFormat.getSplits() call which really gives you a very short time after you submit a job. 发生在InputFormat.getSplits（）调用之前，这实际上使您在提交作业后很短的时间。

Regarding the race condition after splits are computed, note that append to existing files is only available since the version 0.21.0 . 关于计算拆分后的竞争条件，请注意，自版本0.21.0起，追加到现有文件仅可用。

And even if you can modify your files, your split points already precomputed and most likely your new data will not be picked up by mappers. 即使您可以修改文件，拆分点也已经预先计算，并且映射器很可能不会提取新数据。 Though, I doubt that it will lead to a crash of your flow. 不过，我怀疑这是否会导致您的流量崩溃。

What you can experiment with is to disable splits within a file (that is assign a mapper per file) and try to append. 您可以尝试的是禁用文件中的拆分（即为每个文件分配一个映射器）并尝试追加。 I think some data that had a chance to get flushed may end up in a mapper (that's just my wild guess). 我认为一些有机会被刷新的数据可能最终会在映射器中出现（这只是我的猜测）。

Effectively the answer is "no". 有效的答案是“否”。 The splits are computed very early in the game: and after that your new files will not be included. 拆分是在游戏开始时计算的：之后，将不包含新文件。