简体繁体中英

What is the best way to ingest data from Terdata into Hadoop with Informatica?

原文 2017-07-04 16:25:27 8 3 hadoop/ teradata/ informatica/ informatica-powercenter/ bigdata

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?

If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data

What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?

3 answers

If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...

Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.

If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:

Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.

Here is the link to the Cloudera documentation: Using the Cloudera Connector Powered by Teradata

And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):

Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:

split.by.amp

split.by.value

split.by.partition

split.by.hash

split.by.amp Method

This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.

Ingest Mainframe IMS data into Hadoop

hadoop - what is the best way to fetch data from a very big sequence file?

Unable to ingest data from flume to hdfs hadoop for logs

What is best way to see data format in hadoop hdfs?

What is the best way to test hadoop?

What is the best way to add BigDecimals in Hadoop?

Ingest log files from edge nodes to Hadoop

What is the best way to learn about the Hadoop ecosystem

What is the best way to run Lucene/Solr on Hadoop?

What is an efficient way to send data from MongoDB to Hadoop?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Ingest Mainframe IMS data into Hadoop hadoop - what is the best way to fetch data from a very big sequence file? Unable to ingest data from flume to hdfs hadoop for logs What is best way to see data format in hadoop hdfs? What is the best way to test hadoop? What is the best way to add BigDecimals in Hadoop? Ingest log files from edge nodes to Hadoop What is the best way to learn about the Hadoop ecosystem What is the best way to run Lucene/Solr on Hadoop? What is an efficient way to send data from MongoDB to Hadoop?

Related Tags

What is the best way to ingest data from Terdata into Hadoop with Informatica?

Question

3 answers

solution1
1 2017-07-04 21:00:09

solution2
0 2017-07-05 12:06:19

solution3
0 ACCPTED 2017-08-08 07:17:11

What is the best way to ingest data from Terdata into Hadoop with Informatica?

Question

3 answers

solution1 1 2017-07-04 21:00:09

solution2 0 2017-07-05 12:06:19

solution3 0 ACCPTED 2017-08-08 07:17:11

solution1
1 2017-07-04 21:00:09

solution2
0 2017-07-05 12:06:19

solution3
0 ACCPTED 2017-08-08 07:17:11