简体   繁体   English

hadoop-从很大的序列文件中获取数据的最佳方法是什么?

[英]hadoop - what is the best way to fetch data from a very big sequence file?

I have a very big hadoop sequence file in the hdfs. 我在hdfs中有一个很大的hadoop序列文件。 what is the best way to fetch data from it? 从中获取数据的最佳方法是什么? ie, select records and etc.. 即选择记录等。

can it be done by hive? 可以通过蜂巢完成吗? how can i create a table in hive from a sequence file? 如何从序列文件在蜂巢中创建表?

thanks 谢谢

If you need 'quick' access to the data you should either consider loading the data into a datastore of some sort (DB or a noSQL store such as HBase, Accumulo). 如果您需要对数据的“快速”访问,则应考虑将数据加载到某种数据存储中(DB或NoSQL存储,例如HBase,Accumulo)。

Another option (if you can re-write your data) is to look into using a MapFile - this creates an index for the keys in your sequence file and provides quicker access to the data compared to full file scanning. 另一个选择(如果可以重写数据的话)是使用MapFile调查 -与完整文件扫描相比,这将为序列文件中的键创建索引,并提供对数据的更快访问。

Otherwise if you want to use Hive, there's a thread on the hive mailing list about this exact subject: 否则,如果您想使用Hive,则hive邮件列表中有一个与此主题相关的主题:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM