简体   繁体   English

在Talend Open Studio for Data Integration中一次处理一组行

[英]Processing a single group of rows at once in Talend Open Studio for Data Integration

I have a data source where each row has five fields: 我有一个数据源,其中每一行都有五个字段:

company name; year; code; value;

In my target output row model I want to produce a row like so 在我的目标输出行模型中,我想产生一个像这样的行

company name;year;value1;value2;value3;value4

Where value1,value..N are not concatenation for a single code but rather a "mapping". 其中,value1,value..N不是单个代码的串联,而是“映射”。 Ie code 50 => "Total Revenues" 即代码50 =>“总收入”

So I need to perform the following logic: 所以我需要执行以下逻辑:

  1. First select all the records for the same company name / year 首先选择同一公司名称/年份的所有记录
  2. Then apply some custom java logic that performs the mapping between my codes and my fields of the output row. 然后应用一些自定义的Java逻辑来执行我的代码和我的输出行字段之间的映射。

This is a in-memory map reduce with about 1M rows. 这是一个内存映射减少约1M行。 How should this be handled in Talend Open Studio for Data Integration? 在Talend Open Studio for Data Integration中应如何处理?

select all the records for the same company name / year 选择同一公司名称/年份的所有记录

You might want to use tAggregate ( https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.1+tAggregateRow ) to group the flow by company name and year 您可能想使用tAggregatehttps://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.1+tAggregateRow )按公司名称和年份对流程进行分组

apply some custom java logic that performs the mapping between my codes and my fields of the output row. 应用一些自定义的Java逻辑来执行我的代码和我的输出行字段之间的映射。

Talend has a component called tMap that allows you to map input fields into output fields. Talend有一个名为tMap的组件,它允许您将输入字段映射到输出字段。

In your tMap you can use something like: 在您的tMap您可以使用以下方法:

(assuming that input is the name of the flow into your tMap and output is the name of your flow out of your tMap ) (假设input是进入tMap的流程的名称, output是离开tMap流程的名称)

In output.field1 put input.code == 50? input.value : 0 output.field1放入input.code == 50? input.value : 0 input.code == 50? input.value : 0

In output.field2 put input.code == 60? input.value : 0 output.field2放入input.code == 60? input.value : 0 input.code == 60? input.value : 0

In output.field2 put input.code == 70? input.value : 0 output.field2放入input.code == 70? input.value : 0 input.code == 70? input.value : 0

etc 等等

This is assuming you are ok with leaving the field columns with 0 if the value was for another code . 这是假设如果该value是用于另一个code则可以将field列保留为0

If you want the value for each code to be in a different output row out of the tMap you can use a logic similar to the above, only putting each test ( code == 70? input.value : 0 ) in a different output table, and then filtering out the rows that have 0 (using a tFilter ) after the tMap . 如果希望每个codevalue位于tMap的不同输出行中,则可以使用与上述类似的逻辑,仅将每个测试( code == 70? input.value : 0 )放在不同的输出表中,然后滤除tMap之后具有0的行(使用tFilter )。

To add output tables you can use the + symbol on the top right of the tMap . 要添加输出表,可以使用tMap右上方的+符号。

See here for more detials on how to use tMap : https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide54EN/tMap 有关如何使用tMap更多详细信息,请参见此处: https : //help.talend.com/display/TalendOpenStudioComponentsReferenceGuide54EN/tMap

I hope this helps! 我希望这有帮助!

You could do it like that, and it's essentially the approach Maira Bay already suggested: 您可以那样做,这实际上是Maira Bay已经建议的方法:

  1. Set up your data source to emit those lines one at a time. 设置数据源以一次发出这些行。 I used tFixedFlowInput for that. tFixedFlowInput使用了tFixedFlowInput You'd probably have to read from a file. 您可能必须从文件中读取。
  2. Optionally sort by company name and year with a tSortRow . (可选)使用tSortRow按公司名称和年份排序。
  3. Map with a tMap the value of each line to the corresponding column in the result line with a guard clause like input.code.equals("code for this column") ? input.value : null 使用tMap将每行的值映射到结果行中相应的列,并带有保护子句,例如input.code.equals("code for this column") ? input.value : null input.code.equals("code for this column") ? input.value : null . input.code.equals("code for this column") ? input.value : null
  4. Aggregate the rows with a tAggregateRow , grouping by company name and year, selecting the first value for each of the value rows - but make sure to ignore the nulls. 使用tAggregateRow汇总行,按公司名称和年份分组,为每个值行选择第一个值-但请确保忽略空值。
  5. Do anything you want with the resulting lines. 对结果行执行任何您想要的操作。

I tried that with some sample data, hence the tFixedFlowInput in step 1, and it worked for me on my machine in TOS 6.3.1. 我尝试了一些示例数据,因此在步骤1中使用了tFixedFlowInput ,它在TOS 6.3.1中的我的机器上对我tFixedFlowInput

Beware: the solution proposed assumes you only got one value per combination of company name, year and code. 当心:建议的解决方案假设您仅通过公司名称,年份和代码的组合获得一个价值。

See solution below which I believe will fulfill your precise requirement of taking a delimited file data source and transforming it into a denormalized out as specified above. 请参阅下面的解决方案,我相信它将满足您采用分隔文件数据源并将其转换为如上所述的非规范化输出的精确要求。

First I mocked up a file with the same format as you specified. 首先,我模拟了与您指定格式相同的文件。 I made the values a logical concatenation of Company, Year, and sequence. 我将值设置为Company,Year和sequence的逻辑连接。 This makes it easy to verify the output. 这使得验证输出变得容易。

在此处输入图片说明

Next I use that as an input, run it thru a sorter, then denormalize on the value field. 接下来,我将其用作输入,通过分选器运行它,然后在value字段上进行非规范化。 Finally you can see the output in a tLogRow . 最后,您可以在tLogRow看到输出。

在此处输入图片说明

I also included the component view of tDenormalize so you can see how that is done. 我还包括了tDenormalize的组件视图,因此您可以看到它是如何完成的。 You can use this technique in any falvor of Talend Open Studio . 您可以在Talend Open Studio任何支持下使用此技术。
在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM