简体   繁体   English

使用Informatica BDM的Sqoop增量负载

[英]Sqoop incremental load using Informatica BDM

I am new to Informatica BDM.I have a use case in which I have to import the data incrementally (100 tables) from RDBMS into Hive on daily basis. 我是Informatica BDM的新手,我有一个用例,每天必须将RDBMS的数据(100个表)以增量方式导入到Hive中。 Can someone please guide me with the best possible approach to achieve this? 有人可以指导我采用最好的方法来实现这一目标吗?

Thanks, Sumit 谢谢Sumit

Hadoop is write onces read many (WORM) approach and the incremental load is not easy stuff. Hadoop是一次写入多次读取(WORM)的方法,而增量负载并非易事。 There are following guideline you can follow and validate your current requirement 您可以遵循以下准则,并验证您当前的要求

  1. If the table is a small/mid-size and not having too many records, better to refresh the entire table 如果表是小/中型并且没有太多记录,则最好刷新整个表
  2. If the table is too big and incremental load has add/update/delete operation, you can think of staging the delta and perform a join operation to re-create data set. 如果表太大,并且增量负载具有添加/更新/删除操作,则可以考虑暂存增量并执行联接操作以重新创建数据集。
  3. For large table and large delta, you can create a version number for all the latest record and each delta may come to a new directory and a view should be created to get the latest version for further processing. 对于大表和大增量,可以为所有最新记录创建一个版本号,并且每个增量都可以进入新目录,并且应创建一个视图以获取最新版本以进行进一步处理。 This avoid heavy merge operation. 这避免了繁重的合并操作。

If the delete operation is not coming as change, then you also need to think how to act on it and in such case, you need to get the full refresh. 如果删除操作不是更改,那么您还需要考虑如何操作,在这种情况下,您需要进行完全刷新。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM