简体   繁体   English

Sqoop:增量导入问题

[英]Sqoop: Issue with Incremental import

I have a requirement wherein I need to import a table from mysql to hive incrementally and am facing issues in doing that. 我有一个要求,我需要从mysql导入表以逐步配置单元,并且在执行此操作时遇到问题。 This is what I have tried so far: 到目前为止,这是我尝试过的:

  1. I have create a job to import the table with the below mentioned query. 我已经创建了一个作业,用于使用下面提到的查询导入表。

/sqoop job / sqoop工作

--create test2 -- import 
--connect jdbc:mysql://URL
--username username 
--password password 
--table mysqlTablename 
--hive-import
--hive-overwrite
--direct 
--incremental lastmodified 
--check-column last_modified_time 
--last-value 0

First execution: Imports everything as expected with min boundary as '0' and the max boundary as current time. 第一次执行:导入所有内容,最小边界为0,最大边界为当前时间。 Second execution: All the changes from the last run are picked up, but the old rows are getting overwritten leaving only the rows that have been changed from the last run. 第二次执行:提取上次运行中的所有更改,但是旧行将被覆盖,仅保留上次运行中已更改的行。

  1. I removed --hive-overwrite and --hive-import options and used the '--target-dir ' option. 我删除了--hive-overwrite和--hive-import选项,并使用了“ --target-dir”选项。 First execution: Gets everything as expected with min boundary as '0' and the max boundary as current time, but does not show up in Hive as the metastore is not updated. 第一次执行:以最小边界为“ 0”且最大边界为当前时间获得所有预期结果,但由于未更新元存储,因此不会在Hive中显示。 Second execution : It throws an error stating the directoty mentioned as parameter to --target-dir option already exist. 第二次执行:抛出错误,指出作为--target-dir选项的参数提到的直接性已经存在。 This executes after removing the directoty from HDFS but does not solve the purpose. 从HDFS删除方向性后执行此操作,但不能解决目的。

I've found mention of this as a problem, and have not found a solution other than to put the new values into a side directory and run sqoop merge on the data to flatten it. 我发现这是一个问题,除了将新值放入边目录并在数据上运行sqoop merge使其变平以外,没有找到解决方案。 I'd like to automate this in a shell script, and was wondering if there is some better way to handle this incremental update. 我想在shell脚本中自动执行此操作,并且想知道是否有更好的方法来处理此增量更新。

Wanted to check what would be the best option for the requirement wherein I need to import a mysql table and update the changes incrementally based on the column with the last modified time stamp into hive. 想要检查什么是对最佳需求的最佳选择,在该需求中,我需要导入mysql表并根据具有最后修改时间戳的列递增地更新配置单元中的更改。 ie create, update or delete the rows in Hive table based on the changes in mysql to keep them in sync. 即基于mysql中的更改创建,更新或删除Hive表中的行,以使其保持同步。

Any help on this is greatly appreciated. 在此方面的任何帮助将不胜感激。

Regards Rohit 问候罗希特

Its very difficult for the hive-based system to handle the incremental loads involving updates to the records. 对于基于配置单元的系统来说,处理涉及记录更新的增量负载非常困难。 This link would give much viable solution. 链接将提供许多可行的解决方案。

use append command instead of lastmodified to get the updates continuously. 使用append命令而不是lastmodified来连续获取更新。

Eg : 例如:

--create test2 -- import 
--connect jdbc:mysql://URL
--username username 
--password password 
--table mysqlTablename 
--hive-import
--direct 
--incremental append 
--check-column last_modified_time 
--last-value "0"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM