简体   繁体   English

GAE mapreduce:为作业定义参数

[英]GAE mapreduce: define parameters for job

I am fiddling around with GAE mapreduce and have one question: 我在摆弄GAE mapreduce并有一个问题:

Is it possible to change a variable only for a certain job in mapreduce? 是否可以仅针对mapreduce中的特定作业更改变量?

The reason I am asking is: 我问的原因是:

The input csv and output csv of my mapreduce job are supposed to have the same header row - however, the header row is somewhere in the output csv, but never at the top. 我的mapreduce作业的输入csv和输出csv应该具有相同的标头行-但是,标头行在输出csv中的某个位置,但永远不在顶部。 To get the right header row, I inserted a counter into my reduce function that checks the current iteration of the reduce job and if it is 0, it will pass the hard-coded header-row to the pipeline. 为了获得正确的标题行,我在我的reduce函数中插入了一个计数器,该计数器检查reduce作业的当前迭代,如果它是0,它将把硬编码的header-row传递给管道。 The counter gets reset when the output csv gets stored in the blobstore. 当输出csv存储在Blobstore中时,计数器将重置。

The problem: More often than not the counter resets itself randomly, probably because I had to define it as global variable "reduce_counter = 0" outside of the function. 问题:计数器经常会自行重置自身,这可能是因为我必须在函数外部将其定义为全局变量“ reduce_counter = 0”。

Is there any method to chain a variable/parameter to a job or is there any better way to get the header_row? 是否有任何将变量/参数链接到作业的方法,或者有没有更好的方法来获取header_row?

I don't think that I can work with the DictReader or csv module as the output is stored in the blobstore and blobstore objects cannot be altered as far as I know. 我不认为我可以使用DictReader或csv模块,因为输出存储在blobstore中,并且就我所知不能更改blobstore对象。

You can find my code on www.github.com/jvdheyden/ste in the main.py document. 你可以找到我的代码www.github.com/jvdheyden/ste的main.py文件内。

Thanks! 谢谢!

You should add the header after the mapreduce job has finished. 您应该在mapreduce作业完成后添加标头。 You can do this by accessing the output CSV after the task is finished: 您可以通过在任务完成后访问输出CSV来执行此操作:

orig_file = gcs.open(filename_from_mapreduce)
new_file = gcs.open(filename_from_mapreduce + "_with_headers", "r")

new_file.write("your,csv,headers,here")
while True:
   read = orig_file.read()
   if read == '':
      break
   output.write(read)

output.close()
gcs.delete(orig_file)

Your problem happens because the GAE processes mapreduce tasks in multiple shards. 发生您的问题是因为GAE流程会在多个分片中映射任务。 The beauty is that each of those small tasks execute in parallel, which gives us huge time advantage on big amounts of data. 这样做的好处是,每个小任务都可以并行执行,这使我们在处理大量数据上具有巨大的时间优势。

This also explains why your CSV header came in random places, while each shard just writes it's output whenever it gets it's job done. 这也解释了为什么CSV标头出现在随机位置,而每个分片只要完成工作就只写其输出。 Therefore you cannot reliably predict the one that gets to write the first line of the output. 因此,您无法可靠地预测将要写入输出第一行的那个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM