[英]Find first non-null value scanning rows backwards from current row
I'm using Pentaho Data Integration 5.4 我正在使用Pentaho Data Integration 5.4
Input data from a file (simplified, there are more columns): 来自文件的输入数据(简化后,有更多列):
number name
1009 ProductA
2150 ProductB
3235 ProductC
ProductD
ProductE
1234 ProductF
7765 ProductG
4566 ProductH
ProductI
9907 ProductJ
The issue is that I had an Excel
file format xlsx
which has the data with merged cells, and for one value of id
there are 1..n
rows of values. 问题是我有一个Excel
文件格式xlsx
,其中包含带有合并单元格的数据,对于id
一个值,有1..n
行值。
After converting that file to csv
values for next rows (other than first) are missing, despite the one column which was not merged (see example id=3
, id=6
). 在将该文件转换为csv
后,尽管没有合并一列,但缺少下一行(第一行除外)的值(请参见示例id=3
, id=6
)。
I'm generating a sequence
using step Add sequence
, the input is sorted the way it was originally stored in a file. 我正在使用“ Add sequence
步骤生成一个sequence
,将输入最初存储在文件中的方式排序。
Basically what I need to do is: 基本上我需要做的是:
sequence_number
less than current_row.sequence_number
查找第一个非null值,其sequence_number
小于current_row.sequence_number
name
to that matching row 将值从字段name
到该匹配的行 sequence_number
higher than the last scanned 继续扫描sequence_number
高于上次扫描的行 As stated before, there can be 1..n
rows of values for such case. 如前所述,在这种情况下,可以有1..n
行值。
number name
1009 ProductA
2150 ProductB
3235 ProductC; ProductD; ProductE
1234 ProductF
7765 ProductG
4566 ProductH; ProductI
9907 ProductJ
I believe I'm able to do this in a loop, by using Analytic Query
and calculating LAG(1)
and then concatenating the column name
for one row with null values and discarding other column values from null row - and then doing this in a loop (for like 20 times assuming this is maximum), but I do consider this a bad idea. 我相信我可以使用Analytic Query
并计算LAG(1)
,然后将具有空值的一行的列name
连接起来,然后从空行中丢弃其他列的值,从而在一个循环中做到这一点-然后在循环(假设最大次数为20次),但我确实认为这是个坏主意。
There are probably better ways to achieve this result using for example Java Script
step with scanning the rows backward from current (based on sequence
number), but I'm unaware of those functions, if they do exist. 可能有更好的方法来实现此结果,例如,使用Java Script
步骤并从当前位置开始向后扫描行(基于sequence
号),但我不知道这些功能(如果存在)。
How can I achieve this using Modified Java Script Value
step, or any other efficient way without using a loop for entire content of the file until there are no empty rows? 在没有空行的情况下,如何使用“ Modified Java Script Value
步骤或任何其他有效方法来实现此目的,而又不对文件的整个内容使用循环?
Merged adjacent cells in Excel files are presented on the image below. 下图显示了Excel文件中合并的相邻单元格。
When opened as a plain text file, it actually creates gaps (data from merged cell is missing) for every row but first that contains the merged cell. 当作为纯文本文件打开时,它实际上为每一行创建间隙(缺少合并单元格中的数据),但第一行包含合并单元格。
number name
1000/P um6p1
um1p2
um1p3
1500 um2p1
9823 um3p1
83424 um4p1
um4p2
um4p3
um4p4
21390 um5p1
While @bolav answer addresses the problem, there is a simplier and probably more efficient approach to this issue in Kettle
. 虽然@bolav回答解决这个问题,有一个simplier,可能更有效的方法对这一问题的Kettle
。
Microsoft Excel Input
step go to Fields
tab and mark Repeat option as Y for columns that store values in merged cells 在Microsoft Excel Input
步骤中,转到“ Fields
选项卡,然后将“ 重复”选项标记为Y ,以将值存储在合并的单元格中的列 Sort rows
on number column because Group by
step needs the input to be sorted 使用“对数字列进行Sort rows
,因为“ Group by
需要对输入进行排序 Group by
on field number and aggregate name with Concatenate strings separated by
as type and ;
Group by
字段编号和集合名称 Group by
并用Concatenate strings separated by
类型和类型Concatenate strings separated by
;
as value 作为价值 From Pentaho User Guide : 来自Pentaho用户指南 :
Repeat If set to Y, will repeat this value if the field in the next row is empty. 重复如果设置为Y,则如果下一行的字段为空,则将重复此值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.