简体   繁体   English

hive - 如何每天自动将数据附加到 hive 表?

[英]hive - how to automatically append data to hive table every day?

I have a directory in HDFS, where .csv files with fixed structure and column names will be dumped at the end of every day that may look like this:我在 HDFS 中有一个目录,其中具有固定结构和列名的.csv文件将在每天结束时转储,可能如下所示:
在此处输入图片说明
I have a hive table that should have new data appended to it, at the beginning of every day, with data from .csv of previous day's .csv file.我应该从具有附加了新的数据,在每一天的开始,随着数据的蜂巢表.csv前一天的.csv文件。 How do i accomplish this.我如何做到这一点。

I can suggest to use CRON Jobs.我可以建议使用 CRON Jobs。 You create a script that update the tables, and you configure a CRON job to execute that script each at a specific time of the day (for your case the beginning of the day), and then the tables will get updated automatically.您创建一个更新表的脚本,并配置一个 CRON 作业以在一天的特定时间(对于您的情况是一天的开始)执行该脚本,然后表将自动更新。

PS: this solution can be applied only if you're having your server in production, I mean the CRON job should be used in a server that's running 24/24, else, you should use Anacron. PS:仅当您的服务器投入生产时才可以应用此解决方案,我的意思是 CRON 作业应该在 24/24 运行的服务器中使用,否则,您应该使用 Anacron。

Build Hive table on top of that directory in HDFS.在 HDFS 中该目录的顶部构建 Hive 表。 After new files will be dumped in table location, select from that table will pick new files.将新文件转储到表位置后,从该表中选择将选取新文件。 I'd suggest to change the process which dumps files to write into date subfolders and create partitioned table by date.我建议更改转储文件以写入日期子文件夹并按日期创建分区表的过程。 All you need after this is to run recover partitions command before selecting table.在此之后,您只需要在选择表之前运行恢复分区命令

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM