简体   繁体   中英

Incremental data load using sqoop without primary key or timestamp

I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).

My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.

The code below gets the latest row imported to HDFS if my table has primary key.

sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;

Any solution to get the latest data imported without any primary key or date modified.

You can follow these steps

1) The initial load data (previous day data) is in hdfs  - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.

Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.

TransactionHistoryTable

you can create a Insert trigger as shown below:

CREATE TRIGGER transactionInsertTrigger 
ON  [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
 product ,price,payment_type,name,city,state,country,Last_Modified_Date
 )
SELECT
 product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END

Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this. the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.

CREATE TABLE [dbo].[TriggerHistoryTable](
    [product] [varchar](20) NULL,
    [price] [int] NULL,
    [payment_type] [varchar](20) NULL,
    [name] [varchar](20) NULL,
    [city] [varchar](20) NULL,
    [state] [varchar](20) NULL,
    [country] [varchar](20) NULL,
    [Last_Modified_Date] [date] NULL
) ON [PRIMARY]

Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable

insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;

select * from TriggerHistoryTable;

TriggerHistoryTable

Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.

If your data has a field like rowid you can import using --last-value in sqoop arguments .

Please refer to https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM