Should I flatten HL7 data to work with it in Hadoop/Hive? Or extend Hive?

Question

I am working with a large volume of HL7 messages formatted in the 2.x format. The format is a pipe-delimited format where each format looks roughly like this (dummy data):

MSH|^~\&|EPIC|EPICADT|SMS|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|D|2.5|
PID||0493575^^^2^ID 1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|400003403~1129086|
NK1||ROE^MARIE^^^^|SPO||(216)123-4567||EC|||||||||||||||||||||||||||
PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN MYLASTNAME^BONNIE^^^^|||||||||| ||2688684|||||||||||||||||||||||||199912271408||||||002376853

I would like to do large queries / exploration of this data using Hive or something similar. Should I first flatten this data into more of a table format using HParser or something similar? Or would it be worth the time to extend Hive to be able to query this via a custom SerDer or InputFormat?

Answer 1

You should be able to process HL7 with a regex via the RegExSerde relatively easily. That being said, writing a serde isn't terribly difficult (a couple hours) once you grok the objectinspector and other Hive plumbing The custom serde can also supply the field names automatically, but that's of minor beneift. A separate parsing step is overkill.

Answer 2

您可以使用Hadoop API编写自定义的InputFormat和RecordReader ...请参阅此文章以开始使用： http ： //bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

Should I flatten HL7 data to work with it in Hadoop/Hive? Or extend Hive?

Question

2 answers

solution1
4 ACCPTED 2012-11-28 00:37:49

solution2
1 2013-01-08 16:49:43

Should I flatten HL7 data to work with it in Hadoop/Hive? Or extend Hive?

Question

2 answers

solution1 4 ACCPTED 2012-11-28 00:37:49

solution2 1 2013-01-08 16:49:43

solution1
4 ACCPTED 2012-11-28 00:37:49

solution2
1 2013-01-08 16:49:43