Talend并行化和Java范围

Question

I am creating a job to create a complex multi level document for mongoDB from relational data. 我正在创建一个工作，以根据关系数据为mongoDB创建复杂的多级文档。

I read 'product' records in from Oracle. 我从Oracle中读取“产品”记录。

I have a tJavaRow and I use the mongoDB API to create a product document (BasicDBObject) using the product details coming in. I store this document in the global map (call this 'product_doc')..as I need to embed a sub-document in this later in the sub job. 我有一个tJavaRow，我使用mongoDB API使用即将到来的产品详细信息来创建产品文档（BasicDBObject）。我将此文档存储在全局地图中（称为“ product_doc”）。.因为我需要嵌入一个子作业后面的文档。

I use a tFlowToIterate to store the product_id in the globalMap. 我使用tFlowToIterate将product_id存储在globalMap中。

I then have another Oracle input which uses the product_id from the global map as a parameter in the sql, so getting the many part of the relationship to products (call this 'product_orders'). 然后，我有另一个Oracle输入，该输入使用全局映射中的product_id作为sql中的参数，因此获得了与产品关系的许多部分（称为“ product_orders”）。

I build a java List of 'product_order' documents and write the List to the globalMap, let's call this 'product_orders'. 我建立了一个Java的“ product_order”文档列表，并将该列表写入globalMap，我们称其为“ product_orders”。

I then insert the 'product_documents' List as a sub document to the 'product' document in a tJava component. 然后，我将'product_documents'列表作为子文档插入到tJava组件中的'product'文档中。 And I write 'product' to mongoDB and then I move on to the next product row from Oracle. 然后我将“产品”写入mongoDB，然后转到Oracle的下一个产品行。

It is more complex than this, creating a 5 level hierarchy...but this s the basic idea - but it takes 3 hours to run. 它比这更复杂，创建了5级层次结构...但这是基本思想-但是需要3个小时才能运行。

So,I want to set the job to run parallelized, so each product row from Oracle gets despatched onto a new thread...Round Robin style. 因此，我想将作业设置为并行运行，因此来自Oracle的每个产品行都将分派到新线程上。

However, I have a heavy dependency on the globalMap to store objects for later use in the flow....and I know the threads will trample all over each other. 但是，我非常依赖globalMap来存储对象以供以后在流中使用...。我知道线程会相互践踏。 I assume each thread maintains the same variable scope across the sub job... 我假设每个线程在子作业中保持相同的变量作用域...

I can identify the thread_id using a global variable in the globalMap "tCollector_1_THREAD_ID" I think. 我认为可以使用globalMap“ tCollector_1_THREAD_ID”中的全局变量来标识thread_id。

So I had considered doing this when I add documents/objects into the globalMap. 所以我在将文档/对象添加到globalMap时考虑过这样做。

globalMap.put("product_doc_" + globalMap.get(" tCollector_1_THREAD_ID")) globalMap.put（“ product_doc_” + globalMap.get（“ tCollector_1_THREAD_ID”））

So that everything I put in the globalMap is thread specific and tagged...but I don't know how tCollector_1_THREAD_ID gets populated, if it is in the globalMap then surely each thread can trample over this value also? 因此，我在globalMap中放置的所有内容都是特定于线程的并进行了标记...但是我不知道如何填充tCollector_1_THREAD_ID，如果它在globalMap中，那么肯定每个线程也可以践踏此值吗？

It didn't work...I was getting a load of Null Errors. 它没有用...我收到了大量的Null错误。

So I guess my question is about variable scope and use of globalMap when using tJavaRow components in a parallelized data flow, when you need to maintain references in each thread. 因此，我想我的问题是关于在并行化数据流中使用tJavaRow组件时，需要维护每个线程中的引用时，变量范围和对globalMap的使用。

---- UPDATE ------ ----更新------

For clarity if you look at this page it states you can get the thread ID from the variable tCollector_1_Thread_ID. 为了清楚起见，如果您查看此页面，它指出您可以从变量tCollector_1_Thread_ID获取线程ID。 BUt it gets that variable from the globalMap. 但是，它是从globalMap获取该变量的。

Surely the globalMap is a global variable so how can the multiple threads not be all changing this global variable all the time and interfering with each other? 当然，globalMap是一个全局变量，那么多个线程又如何不能一直都在改变这个全局变量并互相干扰呢？

https://help.talend.com//pages/viewpage.action?pageId=265114338 https://help.talend.com//pages/viewpage.action?pageId=265114338

Answer 1

Here's a few approaches that I am using successfully for parallel executions: 这是我成功用于并行执行的几种方法：

If possible create another job that you run parallel, that helps understanding the tasks: 如果可能，创建另一个并行运行的作业，这有助于理解任务：

In this example I use the "Use or Register a shared Db connection" feature, so I re-use my connections. 在此示例中，我使用“使用或注册共享的Db连接”功能，因此我重新使用了连接。 tJavaFlex just contains a simple try{ } catch block, so I handle / hide the errors. tJavaFlex仅包含一个简单的try {} catch块，因此我处理/隐藏了错误。 GpOutput uses a connection that was created outside of the threads. GpOutput使用在线程外部创建的连接。

I more prefer this approach where I create a separate job and use the context parameters to pass information to the job. 在创建单独的作业并使用上下文参数将信息传递给作业时，我更喜欢这种方法。

Here you can see how the globalMap is used. 在这里，您可以了解如何使用globalMap。

I found that the tPartition/Departitioner in talend is very hard to use. 我发现talend中的tPartition / Departitioner很难使用。 I prefer to use more controlled ways to handle parallel executions. 我更喜欢使用更多受控方式来处理并行执行。 Such as a Loop that splits the workload on 20 parallel threads. 例如一个将工作负载拆分为20个并行线程的循环。

  " WHERE mod(num,20) = " + context.i "

Talend并行化和Java范围

问题描述

1 个解决方案

解决方案1
0 2016-01-15 12:01:27

Talend并行化和Java范围

问题描述

1 个解决方案

解决方案1 0 2016-01-15 12:01:27

解决方案1
0 2016-01-15 12:01:27