简体繁体 English

用hadoop MR替换输入文件中的文本

[英]replace text in input file with hadoop MR

原文 2012-04-24 07:26:06 9 1 hadoop/ mapreduce

I am a newbie on the MR and Hadoop front. 我是MR和Hadoop方面的新手。 I wrote an MR for finding missing's in csv file and it is working fine. 我写了一个MR，用于在csv文件中查找缺失的文件，并且工作正常。 now I have an usecase where i need to parse a csv file and code it with the regarding category. 现在我有一个用例，我需要解析一个csv文件并使用About类别对其进行编码。

ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",............. 例如：“ 11，abc，xyz，51,61,78”，“ 11，adc，ryz，41,71,38”，.............

now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",............. 现在必须替换为“ 1，abc，xyz，5,6,7”，“ 1，adc，ryz，4,7,3”，.............

here i am doing a mod of 10 but there will be different cases of mod's. 在这里，我正在做10的mod，但是会有不同的mod情况。 data size is in gb's. 数据大小以GB为单位。

I want to know how to replace the content in-place for the input. 我想知道如何替换输入内容。 Is this achievable with MR? 用MR可以做到吗？

Basically i have not seen any file handling or writing based hadoop examples any where. 基本上我在任何地方都没有看到任何文件处理或编写基于hadoop的示例。

At this point i do not want to go to HBase or other db tools. 此时，我不想使用HBase或其他数据库工具。

1 个解决方案

You can not replace data in place, since HDFS files are append only, and can not be edited. 由于HDFS文件仅附加，并且无法编辑，因此无法就地替换数据。
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL. 我认为实现目标的最简单方法是将Hive中的数据注册为外部表，并在HQL中编写格式。
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs. Hive是一个位于hadoop旁边的系统，可将您的查询转换为MR Jobs。 Its usage is not serious infrastructure decision as HBASE usage 与使用HBASE一样，其使用也不是认真的基础架构决策