简体   繁体   English

如何使用Hadoop Mapreduce将EBCDIC转换为TEXT

[英]How do I convert EBCDIC to TEXT using Hadoop Mapreduce

I need to parse an EBCDIC input file format. 我需要解析EBCDIC输入文件格式。 Using Java, I am able to read it like below: 使用Java,我可以像下面这样阅读:

InputStreamReader rdr = new InputStreamReader(new FileInputStream("/Users/rr/Documents/workspace/EBCDIC_TO_ASCII/ebcdic.txt"), java.nio.charset.Charset.forName("ibm500"));

But in Hadoop Mapreduce, I need to parse via RecordReader which has not worked so far. 但是在Hadoop Mapreduce中,我需要通过RecordReader进行解析,而该记录到目前为止还没有奏效。

Can any one provide a solution to this problem? 有人可以提供解决此问题的方法吗?

最好的办法是先将数据转换为ASCII,然后再加载到HDFS。

Why is the file in EBCDIC ???, does it need to be ??? 为什么文件在EBCDIC中?

If it is just Text data, why not convert it to ascii when you send / pull the file from the Mainframe / AS400 ???. 如果只是文本数据,为什么在从大型机/ AS400 发送/提取文件时不将其转换为ascii

If the file contains binary or Cobol numeric fields then you have several options 如果文件包含二进制或Cobol数字字段,那么您有几种选择

  1. Convert the file to normal Text on the mainframe (The Mainframe Sort utility is good at this), then send the file and convert it (to ascii) . 将文件转换为大型机上的普通文本(Mainframe Sort实用程序很擅长此操作),然后发送文件并将其转换(为ascii)。
  2. If it is a Cobol file, There are some open source projects you could look at https://github.com/tmalaska/CopybookInputFormat or https://github.com/ianbuss/CopybookHadoop 如果是Cobol文件,则可以查看一些开源项目, 网址https://github.com/tmalaska/CopybookInputFormathttps://github.com/ianbuss/CopybookHadoop
  3. There are commercial packages for loading mainframe-Cobol data into hadoop. 有用于将大型机-Cobol数据加载到hadoop的商业软件包。

您可以尝试通过Spark解析它,也许可以使用Cobrix(它是Spark的开源COBOL数据源)进行解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM