我应该压缩HL7数据以在Hadoop / Hive中使用它吗？或者扩展Hive？

Question

I am working with a large volume of HL7 messages formatted in the 2.x format. 我正在处理大量以2.x格式格式化的HL7消息。 The format is a pipe-delimited format where each format looks roughly like this (dummy data): 格式是以管道分隔的格式，其中每种格式大致类似于此（虚拟数据）：

MSH|^~\&|EPIC|EPICADT|SMS|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|D|2.5|
PID||0493575^^^2^ID 1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|400003403~1129086|
NK1||ROE^MARIE^^^^|SPO||(216)123-4567||EC|||||||||||||||||||||||||||
PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN MYLASTNAME^BONNIE^^^^|||||||||| ||2688684|||||||||||||||||||||||||199912271408||||||002376853

I would like to do large queries / exploration of this data using Hive or something similar. 我想使用Hive或类似的东西对这些数据进行大量查询/探索。 Should I first flatten this data into more of a table format using HParser or something similar? 我应该首先使用HParser或类似的方法将这些数据压缩成更多的表格格式吗？ Or would it be worth the time to extend Hive to be able to query this via a custom SerDer or InputFormat? 或者是否值得花时间扩展Hive以便能够通过自定义SerDer或InputFormat进行查询？

Answer 1

You should be able to process HL7 with a regex via the RegExSerde relatively easily. 您应该能够通过RegExSerde相对轻松地使用正则表达式处理HL7。 That being said, writing a serde isn't terribly difficult (a couple hours) once you grok the objectinspector and other Hive plumbing The custom serde can also supply the field names automatically, but that's of minor beneift. 话虽这么说，写一个serde并不是非常困难（几个小时），一旦你了解objectinspector和其他Hive管道自定义serde也可以自动提供字段名称，但这是一个小的好处。 A separate parsing step is overkill. 单独的解析步骤是过度的。

Answer 2

您可以使用Hadoop API编写自定义的InputFormat和RecordReader ...请参阅此文章以开始使用： http ： //bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

我应该压缩HL7数据以在Hadoop / Hive中使用它吗？或者扩展Hive？

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-11-28 00:37:49

解决方案2
1 2013-01-08 16:49:43

我应该压缩HL7数据以在Hadoop / Hive中使用它吗？或者扩展Hive？

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-11-28 00:37:49

解决方案2 1 2013-01-08 16:49:43

解决方案1
4 已采纳 2012-11-28 00:37:49

解决方案2
1 2013-01-08 16:49:43