简体   繁体   中英

Should I flatten HL7 data to work with it in Hadoop/Hive? Or extend Hive?

I am working with a large volume of HL7 messages formatted in the 2.x format. The format is a pipe-delimited format where each format looks roughly like this (dummy data):

MSH|^~\&|EPIC|EPICADT|SMS|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|D|2.5|
PID||0493575^^^2^ID 1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|400003403~1129086|
NK1||ROE^MARIE^^^^|SPO||(216)123-4567||EC|||||||||||||||||||||||||||
PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN MYLASTNAME^BONNIE^^^^|||||||||| ||2688684|||||||||||||||||||||||||199912271408||||||002376853

I would like to do large queries / exploration of this data using Hive or something similar. Should I first flatten this data into more of a table format using HParser or something similar? Or would it be worth the time to extend Hive to be able to query this via a custom SerDer or InputFormat?

You should be able to process HL7 with a regex via the RegExSerde relatively easily. That being said, writing a serde isn't terribly difficult (a couple hours) once you grok the objectinspector and other Hive plumbing The custom serde can also supply the field names automatically, but that's of minor beneift. A separate parsing step is overkill.

您可以使用Hadoop API编写自定义的InputFormat和RecordReader ...请参阅此文章以开始使用: http//bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM