[英]replace text in input file with hadoop MR
I am a newbie on the MR and Hadoop front. 我是MR和Hadoop方面的新手。 I wrote an MR for finding missing's in csv file and it is working fine.
我写了一个MR,用于在csv文件中查找缺失的文件,并且工作正常。 now I have an usecase where i need to parse a csv file and code it with the regarding category.
现在我有一个用例,我需要解析一个csv文件并使用About类别对其进行编码。
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",............. 例如:“ 11,abc,xyz,51,61,78”,“ 11,adc,ryz,41,71,38”,.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",............. 现在必须替换为“ 1,abc,xyz,5,6,7”,“ 1,adc,ryz,4,7,3”,.............
here i am doing a mod of 10 but there will be different cases of mod's. 在这里,我正在做10的mod,但是会有不同的mod情况。 data size is in gb's.
数据大小以GB为单位。
I want to know how to replace the content in-place for the input. 我想知道如何替换输入内容。 Is this achievable with MR?
用MR可以做到吗?
Basically i have not seen any file handling or writing based hadoop examples any where. 基本上我在任何地方都没有看到任何文件处理或编写基于hadoop的示例。
At this point i do not want to go to HBase or other db tools. 此时,我不想使用HBase或其他数据库工具。
You can not replace data in place, since HDFS files are append only, and can not be edited. 由于HDFS文件仅附加,并且无法编辑,因此无法就地替换数据。
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL. 我认为实现目标的最简单方法是将Hive中的数据注册为外部表,并在HQL中编写格式。
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs. Hive是一个位于hadoop旁边的系统,可将您的查询转换为MR Jobs。 Its usage is not serious infrastructure decision as HBASE usage
与使用HBASE一样,其使用也不是认真的基础架构决策
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.