简体   繁体   中英

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?

  • If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
  • if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data

What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?

If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...

Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.

If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:

Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.

Here is the link to the Cloudera documentation: Using the Cloudera Connector Powered by Teradata

And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):

Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:

  • split.by.amp
    • split.by.value
    • split.by.partition
    • split.by.hash
    • split.by.amp Method

This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM