[英]Sqoop incremental load lastmodified not working on updated records
I'm working on a sqoop incremental job to load data from mysql to hdfs. 我正在开发一个sqoop增量作业来将数据从mysql加载到hdfs。 Below is the following scenarios.
以下是以下方案。
Scenario 1: Below are the records inserted in sample table in mysql. 场景1:下面是在mysql中的示例表中插入的记录。
select * from sample;
+-----+--------+--------+---------------------+
| id | policy | salary | updated_time |
+-----+--------+--------+---------------------+
| 100 | 1 | 4567 | 2017-08-02 01:58:28 |
| 200 | 2 | 3456 | 2017-08-02 01:58:29 |
| 300 | 3 | 2345 | 2017-08-02 01:58:29 |
+-----+--------+--------+---------------------+
Below is the table structure of sample table in mysql: 下面是mysql中示例表的表结构:
create table sample (id int not null primary key, policy int, salary int, updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);
I'm trying to import this to hdfs by creating a sqoop job as below 我正在尝试通过创建一个sqoop作业将其导入hdfs,如下所示
sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --split-by id --target-dir /user/cloudera --append --incremental lastmodified --check-column updated_time -m 1
After executing sqoop job below is the output records in hdfs. 执行sqoop job后,下面是hdfs中的输出记录。
$ hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0
Scenario 2: After inserting few new records and updating existing records in sample table. 场景2:插入少量新记录并更新样本表中的现有记录。 Below is the sample table.
以下是样本表。
select * from sample;
+-----+--------+--------+---------------------+
| id | policy | salary | updated_time |
+-----+--------+--------+---------------------+
| 100 | 6 | 5638 | 2017-08-02 02:01:09 |
| 200 | 2 | 7654 | 2017-08-02 02:01:10 |
| 300 | 3 | 2345 | 2017-08-02 01:58:29 |
| 400 | 4 | 1234 | 2017-08-02 02:01:17 |
| 500 | 5 | 6543 | 2017-08-02 02:01:18 |
+-----+--------+--------+---------------------+
After running the same sqoop job below is the records in hdfs. 在运行下面相同的sqoop作业之后是hdfs中的记录。
hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0
100,6,5638,2017-08-02 02:01:09.0
200,2,7654,2017-08-02 02:01:10.0
400,4,1234,2017-08-02 02:01:17.0
500,5,6543,2017-08-02 02:01:18.0
Here the updated records in mysql are inserted as new records in hdfs, instead of updating the existing records in hdfs. 这里mysql中更新的记录作为新记录插入到hdfs中,而不是更新hdfs中的现有记录。 I have used both --merge-key as well as --append in my sqoop job conf.
我在我的sqoop job conf中使用了--merge-key和--append。 Could any help me on how to resolve this issue.
可以帮助我解决这个问题。
You are using --merge-key
--append
and lastmodified
together. 您正在使用
--merge-key
--append
和lastmodified
。 This is nor correct. 这不正确。
--incremental append
mode Append data to an existing dataset in HDFS. --incremental append
mode将数据附加到HDFS中的现有数据集。 You should specify append mode when importing a table where new rows are continually being added with increasing row id values 您应该在导入表时指定追加模式,其中不断添加新行并增加行ID值
--incremental lastmodified
mode - You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. --incremental lastmodified
mode - 当源表的行可以更新时,您应该使用它,并且每次这样的更新都会将最后修改的列的值设置为当前时间戳。
--merge-key
- The merge tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. --merge-key
- 合并工具运行MapReduce作业,该作业将两个目录作为输入:较新的数据集和较旧的数据集。 These are specified with --new-data and --onto respectively. 这些分别用--new-data和--onto指定。 The output of the MapReduce job will be placed in the directory in HDFS specified by
--target-dir
. MapReduce作业的输出将放在
--target-dir
指定的HDFS --target-dir
。
--last-value
(value) Specifies the maximum value of the check column from the previous import. --last-value
(value)指定上一次导入的检查列的最大值。 If you run sqoop from the command line, without Sqoop job, then you have to add --last-value
parameter 如果从命令行运行sqoop,没有Sqoop作业,则必须添加
--last-value
参数
In your case there are some new record and some records are also updated so you need go with lastmodified
mode. 在您的情况下,有一些新的记录,一些记录也更新,所以你需要使用
lastmodified
模式。
Your Sqoop command will be: 您的Sqoop命令将是:
sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --target-dir /user/cloudera --incremental lastmodified --check-column updated_time -m 1
Since you have specified only one mapper there is no need of --split-by
. 由于您只指定了一个映射器,因此不需要
--split-by
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.