Sqoop增量加载lastmodified不能处理更新的记录

Question

I'm working on a sqoop incremental job to load data from mysql to hdfs. 我正在开发一个sqoop增量作业来将数据从mysql加载到hdfs。 Below is the following scenarios. 以下是以下方案。

Scenario 1: Below are the records inserted in sample table in mysql. 场景1：下面是在mysql中的示例表中插入的记录。

select * from sample;
+-----+--------+--------+---------------------+
| id  | policy | salary | updated_time        |
+-----+--------+--------+---------------------+
| 100 |      1 |   4567 | 2017-08-02 01:58:28 |
| 200 |      2 |   3456 | 2017-08-02 01:58:29 |
| 300 |      3 |   2345 | 2017-08-02 01:58:29 |
+-----+--------+--------+---------------------+

Below is the table structure of sample table in mysql: 下面是mysql中示例表的表结构：

create table sample (id int not null primary key, policy int, salary int, updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);

I'm trying to import this to hdfs by creating a sqoop job as below 我正在尝试通过创建一个sqoop作业将其导入hdfs，如下所示

sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --split-by id --target-dir /user/cloudera --append --incremental lastmodified --check-column updated_time -m 1

After executing sqoop job below is the output records in hdfs. 执行sqoop job后，下面是hdfs中的输出记录。

$ hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0

Scenario 2: After inserting few new records and updating existing records in sample table. 场景2：插入少量新记录并更新样本表中的现有记录。 Below is the sample table. 以下是样本表。

select * from sample;
+-----+--------+--------+---------------------+
| id  | policy | salary | updated_time        |
+-----+--------+--------+---------------------+
| 100 |      6 |   5638 | 2017-08-02 02:01:09 |
| 200 |      2 |   7654 | 2017-08-02 02:01:10 |
| 300 |      3 |   2345 | 2017-08-02 01:58:29 |
| 400 |      4 |   1234 | 2017-08-02 02:01:17 |
| 500 |      5 |   6543 | 2017-08-02 02:01:18 |
+-----+--------+--------+---------------------+

After running the same sqoop job below is the records in hdfs. 在运行下面相同的sqoop作业之后是hdfs中的记录。

hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0
100,6,5638,2017-08-02 02:01:09.0
200,2,7654,2017-08-02 02:01:10.0
400,4,1234,2017-08-02 02:01:17.0
500,5,6543,2017-08-02 02:01:18.0

Here the updated records in mysql are inserted as new records in hdfs, instead of updating the existing records in hdfs. 这里mysql中更新的记录作为新记录插入到hdfs中，而不是更新hdfs中的现有记录。 I have used both --merge-key as well as --append in my sqoop job conf. 我在我的sqoop job conf中使用了--merge-key和--append。 Could any help me on how to resolve this issue. 可以帮助我解决这个问题。

Answer 1

You are using --merge-key --append and lastmodified together. 您正在使用--merge-key --append和lastmodified 。 This is nor correct. 这不正确。

--incremental append mode Append data to an existing dataset in HDFS. --incremental append mode将数据附加到HDFS中的现有数据集。 You should specify append mode when importing a table where new rows are continually being added with increasing row id values 您应该在导入表时指定追加模式，其中不断添加新行并增加行ID值
--incremental lastmodified mode - You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. --incremental lastmodified mode - 当源表的行可以更新时，您应该使用它，并且每次这样的更新都会将最后修改的列的值设置为当前时间戳。
--merge-key - The merge tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. --merge-key - 合并工具运行MapReduce作业，该作业将两个目录作为输入：较新的数据集和较旧的数据集。 These are specified with --new-data and --onto respectively. 这些分别用--new-data和--onto指定。 The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir . MapReduce作业的输出将放在--target-dir指定的HDFS --target-dir 。
--last-value (value) Specifies the maximum value of the check column from the previous import. --last-value （value）指定上一次导入的检查列的最大值。 If you run sqoop from the command line, without Sqoop job, then you have to add --last-value parameter 如果从命令行运行sqoop，没有Sqoop作业，则必须添加--last-value参数

In your case there are some new record and some records are also updated so you need go with lastmodified mode. 在您的情况下，有一些新的记录，一些记录也更新，所以你需要使用lastmodified模式。

Your Sqoop command will be: 您的Sqoop命令将是：

sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --target-dir /user/cloudera --incremental lastmodified --check-column updated_time -m 1

Since you have specified only one mapper there is no need of --split-by . 由于您只指定了一个映射器，因此不需要--split-by 。

Answer 2

I understand that you are trying to update the existing records in HDFS whenever there is change happens in the Source MySQL table. 我知道，只要源MySQL表中发生更改，您就会尝试更新HDFS中的现有记录。
You should use --append only when you dont want to update the changed records in source table. 只有当您不想更新源表中更改的记录时，才应使用--append。
Another approach is you can to try to migrate the changed records in a separate directory as delta_records and then join it with the base_records. 另一种方法是您可以尝试将更改的记录作为delta_records迁移到单独的目录中，然后将其与base_records连接。 Please see hortonworks link for more clarity 请参阅hortonworks 链接以获得更清晰的信息

Sqoop增量加载lastmodified不能处理更新的记录

问题描述

2 个解决方案

解决方案1
1 2017-08-02 10:08:43

解决方案2
0 2017-08-02 18:16:13

Sqoop增量加载lastmodified不能处理更新的记录

问题描述

2 个解决方案

解决方案1 1 2017-08-02 10:08:43

解决方案2 0 2017-08-02 18:16:13

解决方案1
1 2017-08-02 10:08:43

解决方案2
0 2017-08-02 18:16:13