简体   繁体   English

用hadoop MR替换输入文件中的文本

[英]replace text in input file with hadoop MR

I am a newbie on the MR and Hadoop front. 我是MR和Hadoop方面的新手。 I wrote an MR for finding missing's in csv file and it is working fine. 我写了一个MR,用于在csv文件中查找缺失的文件,并且工作正常。 now I have an usecase where i need to parse a csv file and code it with the regarding category. 现在我有一个用例,我需要解析一个csv文件并使用About类别对其进行编码。

ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",............. 例如:“ 11,abc,xyz,51,61,78”,“ 11,adc,ryz,41,71,38”,.............

now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",............. 现在必须替换为“ 1,abc,xyz,5,6,7”,“ 1,adc,ryz,4,7,3”,.............

here i am doing a mod of 10 but there will be different cases of mod's. 在这里,我正在做10的mod,但是会有不同的mod情况。 data size is in gb's. 数据大小以GB为单位。

I want to know how to replace the content in-place for the input. 我想知道如何替换输入内容。 Is this achievable with MR? 用MR可以做到吗?

Basically i have not seen any file handling or writing based hadoop examples any where. 基本上我在任何地方都没有看到任何文件处理或编写基于hadoop的示例。

At this point i do not want to go to HBase or other db tools. 此时,我不想使用HBase或其他数据库工具。

You can not replace data in place, since HDFS files are append only, and can not be edited. 由于HDFS文件仅附加,并且无法编辑,因此无法就地替换数据。
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL. 我认为实现目标的最简单方法是将Hive中的数据注册为外部表,并在HQL中编写格式。
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs. Hive是一个位于hadoop旁边的系统,可将您的查询转换为MR Jobs。 Its usage is not serious infrastructure decision as HBASE usage 与使用HBASE一样,其使用也不是认真的基础架构决策

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM