简体繁体 English

Talend表现

[英]Talend performance

原文 2017-03-22 18:55:29 2 2 performance/ etl/ talend

We have a requirement where we are reading data from three different files and doing joins among these files with different columns in the same job. 我们要求从三个不同的文件中读取数据，并在同一作业中使用不同的列在这些文件之间进行连接。

Each file size is around 25-30 GB. 每个文件大小约为25-30 GB。 Our system RAM size is just 16GB. 我们的系统RAM大小仅为16GB。 Doing joins with tmap. 使用tmap进行连接。 Talend is keeping all the reference data in physical memory. Talend将所有参考数据保存在物理内存中。 In my case, i cannot provide that much memory. 就我而言，我无法提供那么多记忆。 Job fails due to out of memory. 由于内存不足，作业失败。 If i use join with temp disk option in tmap, job was dead slow. 如果我在tmap中使用了带临时磁盘选项的连接，那么工作就会变慢。

Please help me with these questions. 请帮我解决这些问题。

How Talend process the data larger than RAM size? Talend如何处理大于RAM大小的数据？
Pipeline parallelism is in place with talend? 管道并行性与talend一起到位了吗？ Am i missing anything in the code to accomplish that? 我错过了代码中的任何内容来实现这一目标吗？
tuniq & Join operations was done in physical memory,causing the job to run dead slow. tuniq和Join操作是在物理内存中完成的，导致作业运行缓慢。 Disk option is available to handle these functionality, but it was too slow. 磁盘选项可用于处理这些功能，但速度太慢。
How performance can be improved without pushing the data to DB(ELT). 如何在不将数据推送到DB（ELT）的情况下提高性能。 Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM? 是否talend可以处理数百万的大量数据。需要用较少量的RAM来处理这种数据吗？

Thanks 谢谢

2 个解决方案

Talend process the Large amount of data very fast and in efficient manner. Talend以非常快速且有效的方式处理大量数据。 Its all depends on your knowledge about Talend Platforms. 这完全取决于您对Talend平台的了解。

Please consider the below comments as answers for your questions. 请将以下评论视为您问题的答案。

Q1.How Talend process the data larger than RAM size? Q1 .Talend如何处理大于RAM大小的数据？

A. You can not use your entire RAM for Talend studio. 答：您不能将整个RAM用于Talend工作室。 Only a fraction of RAM can be used its almost half of your RAM. 只有一小部分RAM可用于几乎一半的RAM。

For example:- With 8 GB of memory available on 64-bit system, the optimal settings can be: -vmargs 例如： - 在64位系统上有8 GB的可用内存，最佳设置可以是：-vmargs

-Xms1024m -Xms1024m

-Xmx4096m -Xmx4096m

-XX:MaxPermSize=512m -XX：MaxPermSize参数=512米

-Dfile.encoding=UTF-8 -Dfile.encoding = UTF-8

Now in your case either you have to increase your RAM with 100 GB 现在，在您的情况下，您必须增加100 GB的RAM

OR simply write the data on hard disk. 或者只是将数据写入硬盘。 For this you have to choose a Temp data directory for buffer components like- tMap, tBufferInputs, tAggregatedRow etc. 为此，您必须为缓冲区组件选择Temp数据目录，如tMap，tBufferInputs，tAggregatedRow等。

Q2. Q2。 Pipeline parallelism is in place with talend? 管道并行性与talend一起到位了吗？ Am i missing anything in the code to accomplish that? 我错过了代码中的任何内容来实现这一目标吗？

A. In Talend Studio, parallelization of data flows means to partition an input data flow of a Subjob into parallel processes and to simultaneously execute them, so as to gain better performance. 答：在Talend Studio中，数据流的并行化意味着将Subjob的输入数据流划分为并行进程并同时执行它们，以获得更好的性能。

But this feature is available only on the condition that you have subscribed to one of the Talend Platform solutions. 但只有在您订阅了Talend Platform解决方案之一的情况下才能使用此功能。

When you have to develop a Job to process very huge data using Talend Studio, you can enable or disable the parallelization by one single click, and then the Studio automates the implementation across a given Job 当您必须使用Talend Studio开发作业来处理非常大的数据时，您只需单击一下即可启用或禁用并行化，然后Studio会在给定作业中自动执行

enter image description here 在此输入图像描述

Parallel Execution The implementation of the parallelization requires four key steps as explained as follows: 并行执行并行化的实现需要四个关键步骤，如下所述：

Partitioning (): In this step, the Studio splits the input records into a given number of threads. Partitioning（）：在此步骤中，Studio将输入记录拆分为给定数量的线程。

Collecting (): In this step, the Studio collects the split threads and sends them to a given component for processing. Collecting（）：在此步骤中，Studio收集拆分线程并将它们发送到给定组件进行处理。

Departitioning (): In this step, the Studio groups the outputs of the parallel executions of the split threads. Departitioning（）：在此步骤中，Studio将分割线程的并行执行的输出分组。

Recollecting (): In this step, the Studio captures the grouped execution results and outputs them to a given component. Recollecting（）：在此步骤中，Studio捕获分组的执行结果并将它们输出到给定的组件。

Q3. Q3。 tuniq & Join operations was done in physical memory,causing the job to run dead slow. tuniq和Join操作是在物理内存中完成的，导致作业运行缓慢。 Disk option is available to handle these functionality, but it was too slow. 磁盘选项可用于处理这些功能，但速度太慢。

Q4. Q4。 How performance can be improved without pushing the data to DB(ELT). 如何在不将数据推送到DB（ELT）的情况下提高性能。 Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM? 是否talend可以处理数百万的大量数据。需要用较少量的RAM来处理这种数据吗？

A 3&4. A 3和4。 Here I will suggest you to insert the data directly into database using tOutputBulkExec. 在这里，我建议您使用tOutputBulkExec将数据直接插入数据库。 components and then you can apply these operation using ELT components on DB level. 组件，然后您可以在数据库级别使用ELT组件应用这些操作。

You can try out some changes in jobdefinition itself. 您可以尝试在jobdefinition中进行一些更改。 Like: 喜欢：

-- Use Streaming -- Use Trimming for big stringdata. - 使用流媒体 - 对大字符数据使用修剪。 So transfer of unnecessary data will prevent. 因此，传输不必要的数据将会阻止。 -- Use as connector OnSubjobOk instead OnComponentOk so the Garbage Collector has chance to freeing more data in time - 用作连接器OnSubjobOk而不是OnComponentOk，以便垃圾收集器有机会及时释放更多数据