简体   繁体   English

如何使用Sqoop从MySQL增量导入到Hive?

[英]How to incremental import from MySQL to Hive using Sqoop?

I can successfully do an incremental import from MySQL to HDFS using Sqoop by 我可以使用Sqoop通过以下方式成功地从MySQL增量导入HDFS

sqoop job -create JOBNAME ... --incremental append --check-column id --last-value LAST
sqoop job -exec JOBNAME

That finishes with log messages like 最后以类似的日志消息

INFO tool.ImportTool: Saving incremental import state to the metastore
INFO tool.ImportTool: Updated data for job: JOBNAME

And inspecting the job reveals that incremental.last.value was updated correctly. 并且检查该作业显示增量.last.value已正确更新。

If I attempt the same procedure, but add "--hive-import" to the definition of my job, it will execute successfully, but won't update incremental.last.value. 如果我尝试相同的过程,但是在作业的定义中添加“ --hive-import”,它将成功执行,但不会更新cremental.last.value。

Is this a bug? 这是错误吗? Intended behavior? 预期的行为? Does anyone have a procedure for incrementally importing data from MySQL and making it available via Hive? 有没有人有从MySQL增量导入数据并使其通过Hive可用的过程?

I basically want my Hadoop cluster to be a read slave of my MySQL database, for fast analysis. 我基本上希望Hadoop集群成为MySQL数据库的读取从属服务器,以便进行快速分析。 If there's some other solution than Hive (Pig would be fine), I'd love to hear that too. 如果除Hive之外还有其他解决方案(猪也可以),我也很想听听。

The option --hive import is used to create defined structure of the table on HDFS using mapreduce jobs.Moreover,the data being read on to Hive is Read Schema!!! --hive import选项用于使用mapreduce作业在HDFS上创建表的定义结构。此外,要读取到Hive的数据是Read Schema! .Which means the data is not actually imported on to it unless the query is executed.So everytime ,you run a file,the query is executed on the schema newly(freshly) on the table in Hive.So it doesnt store the last incremental value. 这意味着除非执行查询,否则数据实际上不会导入到数据上。因此,每次运行文件时,查询都是在Hive的表上新(新近)在架构上新执行的,因此它不存储最后一个增量值。

Every query on the Hive schema is treated to be independent as it is run at execution time and doesnt store old results. Hive模式上的每个查询都被视为独立查询,因为它在执行时运行,并且不存储旧结果。

您也可以手动创建外部配置单元表,因为这只是一次活动,因此可以继续导入增量数据。

我们可以使用以下脚本获取最后一个值。

--check_colum colname=id -- incremental append or lastmodified --last_value $(HIVE_HOME /bin/hive -e'select max(id) from tablename') 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM