简体繁体 English

什么是 Hive 常见用例？

[英]What are Hive Common Use Cases?

原文 2017-10-15 17:09:56 8 1 hadoop/ hive/ nosql

I'm new to Hive;我是 Hive 的新手； so, I'm not sure how companies use Hive.所以，我不确定公司如何使用 Hive。 Let me give you a scenario and see if I'm conceptually correct about the use of Hive.让我给你一个场景，看看我在概念上对 Hive 的使用是否正确。

Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs.假设我的公司想要保留一些 Web 服务器日志文件，并且能够始终搜索和分析日志。 So, I create a table columns of which correspond to the columns in the log file.因此，我创建了一个表列，其中的列对应于日志文件中的列。 Then I load the log file into the table.然后我将日志文件加载到表中。 Now, I can start query the data.现在，我可以开始查询数据了。 So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.因此，随着数据在未来的日期到来，我只是不断地将数据添加到这个表中，因此我总是将我的日志文件作为 Hive 中的一个表，我可以搜索和分析。

Is that scenario above a common use?上面的场景是一种常见用途吗？ And if it is, then how do I keep adding new log files to the table?如果是，那么我如何继续向表中添加新的日志文件？ Do I have to keep adding them to the table manually each day?我是否必须每天手动将它们添加到表中？

1 个解决方案

You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this.您可以使用 Hive 对静态数据集进行分析，但如果您有流式日志，我真的不建议使用 Hive。 It's not a search engine and will take minutes just to find any reasonable data you're looking for.它不是搜索引擎，只需几分钟即可找到您正在寻找的任何合理数据。

HBase would probably be a better alternative if you must stay within the Hadoop ecosystem.如果您必须留在 Hadoop 生态系统中，HBase 可能是更好的选择。 (Hive can query Hbase) （Hive可以查询Hbase）

Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.如果您需要合理的日志分析工具，请使用 Splunk 或 Solr / Elasticsearch / Graylog 的开源替代品。

But to answer your questions但要回答你的问题

how do I keep adding new log files to the table?我如何不断向表中添加新的日志文件？ Do I have to keep adding them to the table manually each day?我是否必须每天手动将它们添加到表中？

Use an EXTERNAL Hive table over an HDFS location for your logs.在 HDFS 位置上使用EXTERNAL Hive 表作为日志。 Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)使用 Flume 将日志数据发送到该路径（或将日志发送到 Kafka，从 Kafka 发送到 HDFS，以及搜索/分析系统）

You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries).如果您要添加日期分区，您只需要更新表（您应该这样做，因为这是您获得更快 Hive 查询的方式）。 You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS.您将使用MSCK REPAIR TABLE来检测 HDFS 上丢失的分区。 Or run ALTER TABLE ADD PARTITION yourself on a schedule.或者自己按计划运行ALTER TABLE ADD PARTITION 。 Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you注意：Confluent 的 HDFS Kafka Connect 会自动为你创建 Hive 表分区

If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format如果必须使用 Hive，将数据转换为 ORC 或 Parquet 格式可以更好地改进查询