简体   繁体   English

Talend并行化和Java范围

[英]Talend Parallelization and Java Scope

I am creating a job to create a complex multi level document for mongoDB from relational data. 我正在创建一个工作,以根据关系数据为mongoDB创建复杂的多级文档。

I read 'product' records in from Oracle. 我从Oracle中读取“产品”记录。

I have a tJavaRow and I use the mongoDB API to create a product document (BasicDBObject) using the product details coming in. I store this document in the global map (call this 'product_doc')..as I need to embed a sub-document in this later in the sub job. 我有一个tJavaRow,我使用mongoDB API使用即将到来的产品详细信息来创建产品文档(BasicDBObject)。我将此文档存储在全局地图中(称为“ product_doc”)。.因为我需要嵌入一个子作业后面的文档。

I use a tFlowToIterate to store the product_id in the globalMap. 我使用tFlowToIterate将product_id存储在globalMap中。

I then have another Oracle input which uses the product_id from the global map as a parameter in the sql, so getting the many part of the relationship to products (call this 'product_orders'). 然后,我有另一个Oracle输入,该输入使用全局映射中的product_id作为sql中的参数,因此获得了与产品关系的许多部分(称为“ product_orders”)。

I build a java List of 'product_order' documents and write the List to the globalMap, let's call this 'product_orders'. 我建立了一个Java的“ product_order”文档列表,并将该列表写入globalMap,我们称其为“ product_orders”。

I then insert the 'product_documents' List as a sub document to the 'product' document in a tJava component. 然后,我将'product_documents'列表作为子文档插入到tJava组件中的'product'文档中。 And I write 'product' to mongoDB and then I move on to the next product row from Oracle. 然后我将“产品”写入mongoDB,然后转到Oracle的下一个产品行。

It is more complex than this, creating a 5 level hierarchy...but this s the basic idea - but it takes 3 hours to run. 它比这更复杂,创建了5级层次结构...但这是基本思想-但是需要3个小时才能运行。

So,I want to set the job to run parallelized, so each product row from Oracle gets despatched onto a new thread...Round Robin style. 因此,我想将作业设置为并行运行,因此来自Oracle的每个产品行都将分派到新线程上。

However, I have a heavy dependency on the globalMap to store objects for later use in the flow....and I know the threads will trample all over each other. 但是,我非常依赖globalMap来存储对象以供以后在流中使用...。我知道线程会相互践踏。 I assume each thread maintains the same variable scope across the sub job... 我假设每个线程在子作业中保持相同的变量作用域...

I can identify the thread_id using a global variable in the globalMap "tCollector_1_THREAD_ID" I think. 我认为可以使用globalMap“ tCollector_1_THREAD_ID”中的全局​​变量来标识thread_id。

So I had considered doing this when I add documents/objects into the globalMap. 所以我在将文档/对象添加到globalMap时考虑过这样做。

globalMap.put("product_doc_" + globalMap.get(" tCollector_1_THREAD_ID")) globalMap.put(“ product_doc_” + globalMap.get(“ tCollector_1_THREAD_ID”))

So that everything I put in the globalMap is thread specific and tagged...but I don't know how tCollector_1_THREAD_ID gets populated, if it is in the globalMap then surely each thread can trample over this value also? 因此,我在globalMap中放置的所有内容都是特定于线程的并进行了标记...但是我不知道如何填充tCollector_1_THREAD_ID,如果它在globalMap中,那么肯定每个线程也可以践踏此值吗?

It didn't work...I was getting a load of Null Errors. 它没有用...我收到了大量的Null错误。

So I guess my question is about variable scope and use of globalMap when using tJavaRow components in a parallelized data flow, when you need to maintain references in each thread. 因此,我想我的问题是关于在并行化数据流中使用tJavaRow组件时,需要维护每个线程中的引用时,变量范围和对globalMap的使用。

---- UPDATE ------ ----更新------

For clarity if you look at this page it states you can get the thread ID from the variable tCollector_1_Thread_ID. 为了清楚起见,如果您查看此页面,它指出您可以从变量tCollector_1_Thread_ID获取线程ID。 BUt it gets that variable from the globalMap. 但是,它是从globalMap获取该变量的。

Surely the globalMap is a global variable so how can the multiple threads not be all changing this global variable all the time and interfering with each other? 当然,globalMap是一个全局变量,那么多个线程又如何不能一直都在改变这个全局变量并互相干扰呢?

https://help.talend.com//pages/viewpage.action?pageId=265114338 https://help.talend.com//pages/viewpage.action?pageId=265114338

Here's a few approaches that I am using successfully for parallel executions: 这是我成功用于并行执行的几种方法:

If possible create another job that you run parallel, that helps understanding the tasks: 如果可能,创建另一个并行运行的作业,这有助于理解任务: 使用迭代链接

In this example I use the "Use or Register a shared Db connection" feature, so I re-use my connections. 在此示例中,我使用“使用或注册共享的Db连接”功能,因此我重新使用了连接。 tJavaFlex just contains a simple try{ } catch block, so I handle / hide the errors. tJavaFlex仅包含一个简单的try {} catch块,因此我处理/隐藏了错误。 GpOutput uses a connection that was created outside of the threads. GpOutput使用在线程外部创建的连接。

在此处输入图片说明

I more prefer this approach where I create a separate job and use the context parameters to pass information to the job. 在创建单独的作业并使用上下文参数将信息传递给作业时,我更喜欢这种方法。 在此处输入图片说明

Here you can see how the globalMap is used. 在这里,您可以了解如何使用globalMap。

I found that the tPartition/Departitioner in talend is very hard to use. 我发现talend中的tPartition / Departitioner很难使用。 I prefer to use more controlled ways to handle parallel executions. 我更喜欢使用更多受控方式来处理并行执行。 Such as a Loop that splits the workload on 20 parallel threads. 例如一个将工作负载拆分为20个并行线程的循环。

  " WHERE mod(num,20) = " + context.i "

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM