简体   繁体   English

即时将输入数据添加到Hadoop Map-Reduce Job?

[英]Add input data on the fly to Hadoop Map-Reduce Job?

我可以在运行时将输入文件或输入数据追加到map-reduce作业中而不创建竞争条件吗?

I think in theory you can add more files into the input as long as it: 我认为理论上您可以在输入中添加更多文件,只要它可以:

  1. Matches your FileInputFormat pattern 匹配您的FileInputFormat模式
  2. Happens before InputFormat.getSplits() call which really gives you a very short time after you submit a job. 发生在InputFormat.getSplits()调用之前,这实际上使您在提交作业后很短的时间。

Regarding the race condition after splits are computed, note that append to existing files is only available since the version 0.21.0 . 关于计算拆分后的竞争条件,请注意,自版本0.21.0起,追加到现有文件仅可用。

And even if you can modify your files, your split points already precomputed and most likely your new data will not be picked up by mappers. 即使您可以修改文件,拆分点也已经预先计算,并且映射器很可能不会提取新数据。 Though, I doubt that it will lead to a crash of your flow. 不过,我怀疑这是否会导致您的流量崩溃。

What you can experiment with is to disable splits within a file (that is assign a mapper per file) and try to append. 您可以尝试的是禁用文件中的拆分(即为每个文件分配一个映射器)并尝试追加。 I think some data that had a chance to get flushed may end up in a mapper (that's just my wild guess). 我认为一些有机会被刷新的数据可能最终会在映射器中出现(这只是我的猜测)。

Effectively the answer is "no". 有效的答案是“否”。 The splits are computed very early in the game: and after that your new files will not be included. 拆分是在游戏开始时计算的:之后,将不包含新文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM