简体   繁体   English

将Json转换为hadoop的顺序文件

[英]Converting Json into sequential file for hadoop

I have a json file (Size 2-3 GB) stored inside HDFS. 我在HDFS中存储了一个json文件(大小为2-3 GB)。 My flies look like this format 我的苍蝇看起来像这种格式

{ "DateTime" : 24-08-2015T00:00:00, "Cost":53.09,"UID":9,"Channel":"some Channel"}
{ "DateTime" : 25-08-2015T00:00:00, "Cost":54.09,"UID":8,"Channel":"some Channel2"}
{ "DateTime" : 24-08-2015T00:00:00, "Cost":56.09,"UID":7,"Channel":"some Channel3"}

I am trying to write a map reduce to convert this json files into sequential files and then read the json object. 我试图写一个map reduce将这个json文件转换成顺序文件,然后读取json对象。 Since I need faster execution using gson and then converting it in java object will take time. 由于我需要使用gson更快地执行代码,然后在java对象中进行转换将需要时间。 I googled about it and found JAQL can do the same thing, but I didn't get any Java MR code to do it. 我在Google上搜索了一下,发现JAQL可以做同样的事情,但是我没有得到任何Java MR代码来做。 I even didn't found maven jars for JAQL. 我什至没有找到JAQL的Maven罐子。 I can't set it explicitly on my server. 我无法在服务器上显式设置它。 Is there any way to achieve it using Java code? 有什么方法可以使用Java代码来实现?

I'd offer Tika 我要提卡
Description of this project : Apache Tika integration with Jaql using MapReduce for Hadoop 项目说明:使用MapReduce for Hadoop将Apache Tika与Jaql集成

This project helps to get over the inefficiency of processing multiple small files in Hadoop using Jaql. 该项目有助于克服使用Jaql在Hadoop中处理多个小文件的效率低下的问题。 Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in Jaql which will in turn spawn a MapReduce job. 此外,通过将其集成到Jaql中,它允许使用Apache Tika在Hadoop中处理和分析二进制文档,这反过来又会产生MapReduce作业。 pls Check the samples 请检查样品

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM