简体   繁体   English

使用sqoop而不使用主键或时间戳的增量数据加载

[英]Incremental data load using sqoop without primary key or timestamp

I have a table that doesn't have any primary key and datemodified/timestamp. 我有一个没有任何主键和datemodified / timestamp的表。 This table is just like a transaction table that keeps saving all data (No delete/update). 此表就像一个保存所有数据的事务表(无删除/更新)。

My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load. 我现在的问题是我想将数据注入HDFS,而不是每次运行增量加载时都重新加载整个表。

The code below gets the latest row imported to HDFS if my table has primary key. 如果我的表有主键,下面的代码将获取导入HDFS的最新行。

sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;

Any solution to get the latest data imported without any primary key or date modified. 在没有修改任何主键或日期的情况下导入最新数据的任何解决方案。

You can follow these steps

1) The initial load data (previous day data) is in hdfs  - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

I know I am bit late to answer this, but just wanted to share for reference. 我知道我回答这个问题有点迟,但只是想分享一下参考。 If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs. 如果您的源表上没有主键列或日期列,并且您希望仅将增量数据sqoop到hdfs。

Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. 假设有一些表保存数据的历史记录和每天插入的新行,您只需要新插入的行到hdfs。 if your source is sql server you can create Insert or Update trigger on your history table. 如果您的源是SQL Server,则可以在历史记录表上创建“插入”或“更新”触发器。

TransactionHistoryTable

you can create a Insert trigger as shown below: 你可以创建一个Insert触发器,如下所示:

CREATE TRIGGER transactionInsertTrigger 
ON  [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
 product ,price,payment_type,name,city,state,country,Last_Modified_Date
 )
SELECT
 product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END

Create a Table to hold the records when an insert events occurs on your main table. 在主表上发生插入事件时创建一个表来保存记录。 Keep the schema same as your main table, however you can add extra columns to this. 保持架构与主表相同,但是您可以为此添加额外的列。 the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable. 只要有任何新行插入主TransactionHistoryTable,上面的触发器就会在表中插入一行。

CREATE TABLE [dbo].[TriggerHistoryTable](
    [product] [varchar](20) NULL,
    [price] [int] NULL,
    [payment_type] [varchar](20) NULL,
    [name] [varchar](20) NULL,
    [city] [varchar](20) NULL,
    [state] [varchar](20) NULL,
    [country] [varchar](20) NULL,
    [Last_Modified_Date] [date] NULL
) ON [PRIMARY]

Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable 现在,如果我们向主TransactionHistoryTable插入两个新行,由于这个插入evert,我们的触发器被触发了,并且还将这两行插入到TriggerHistoryTable以及主TransactionHistoryTable

insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;

select * from TriggerHistoryTable;

TriggerHistoryTable

Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. 现在,您可以从TriggerHistoryTable中进行sqoop,它将具有每日插入或更新的记录。 You can use Incremental sqoop also since we have added a date column to this. 您也可以使用增量sqoop,因为我们已为此添加了日期列。 once you have imported data to hdfs you can clear this table on daily basis or weekly. 将数据导入hdfs后,您可以每天或每周清除此表。 This is just an example with sql server. 这只是sql server的一个例子。 you can have triggers with Teradata and oracle and other databases also. 你也可以使用Teradata和oracle以及其他数据库。 you can also set up a update/delete trigger also. 您也可以设置更新/删除触发器。

If your data has a field like rowid you can import using --last-value in sqoop arguments . 如果您的数据有像rowid这样的字段,您可以使用sqoop参数中的--last-value导入。

Please refer to https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports 请参阅https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM