简体繁体 English

Hadoop流多行输入

[英]Hadoop Streaming Multiline Input

原文 2010-07-24 18:15:11 3 2 python/ streaming/ hadoop/ hadoop-streaming

I'm using Dumbo for some Hadoop Streaming jobs. 我将Dumbo用于某些Hadoop Streaming作业。 I have a bunch of JSON dictionaries each containing an article (multiline text) and some meta data. 我有一堆JSON字典，每个字典都包含一篇文章（多行文本）和一些元数据。 I know Hadoop performs best when give large files, so I want to concat all the JSON dictionaries into a single file. 我知道Hadoop在提供大文件时性能最佳，因此我想将所有JSON字典合并为一个文件。

The problem is that I don't know how to make Hadoop read each dictionary/article as a separate value instead of splitting on newlines. 问题是我不知道如何使Hadoop将每个字典/文章作为一个单独的值读取，而不是在换行符上分割。 How can I tell Hadoop to use a custom record separator? 如何告诉Hadoop使用自定义记录分隔符？ Or maybe I can put all of the JSON dictionaries into a list data structure and have Hadoop read that in? 或者，也许我可以将所有JSON字典放入列表数据结构中，并让Hadoop读取？

Or maybe encoding the string (base64?) would remove all of the new lines and the normal "reader" would be able to handle it? 也许编码字符串（base64？）会删除所有新行，而普通的“阅读器”将能够处理它？

2 个解决方案

You can just replace all newlines with spaecs in each dictionary when concatenating your JSON files. 串联JSON文件时，只需在每个字典中用spaecs替换所有换行符。 Newline doesn't have any special meaning in JSON besides being a whitespace character. 换行符除了是空白字符外，在JSON中没有任何特殊含义。

concatenated-json-mapreduce is a custom input format and record reader will split the JSON objects based on push/pop on the open/closing brackets. concatenated-json-mapreduce是一种自定义输入格式，记录读取器将根据打开/关闭方括号中的push / pop拆分JSON对象。

It was written to handle streaming JSON (rather than newline-separated JSON) so as long as it's well formed JSON objects using \\n instead of actual new lines it should work. 它被编写来处理流JSON（而不是用换行符分隔的JSON），只要它使用\\n而不是实际的新行来构成正确的JSON对象即可。