简体   繁体   English

从当前行向后查找第一个非空值扫描行

[英]Find first non-null value scanning rows backwards from current row

Software 软件

I'm using Pentaho Data Integration 5.4 我正在使用Pentaho Data Integration 5.4

Input data & explanation 输入数据及说明

Input data from a file (simplified, there are more columns): 来自文件的输入数据(简化后,有更多列):

number       name
1009       ProductA
2150       ProductB
3235       ProductC
           ProductD
           ProductE
1234       ProductF
7765       ProductG
4566       ProductH
           ProductI
9907       ProductJ

The issue is that I had an Excel file format xlsx which has the data with merged cells, and for one value of id there are 1..n rows of values. 问题是我有一个Excel文件格式xlsx ,其中包含带有合并单元格的数据,对于id一个值,有1..n行值。

After converting that file to csv values for next rows (other than first) are missing, despite the one column which was not merged (see example id=3 , id=6 ). 在将该文件转换为csv后,尽管没有合并一列,但缺少下一行(第一行除外)的值(请参见示例id=3id=6 )。

I'm generating a sequence using step Add sequence , the input is sorted the way it was originally stored in a file. 我正在使用“ Add sequence步骤生成一个sequence ,将输入最初存储在文件中的方式排序。

Steps to achieve the goal 实现目标的步骤

Basically what I need to do is: 基本上我需要做的是:

  • Find first non-null value that has sequence_number less than current_row.sequence_number 查找第一个非null值,其sequence_number小于current_row.sequence_number
  • Concatenate the value from field name to that matching row 将值从字段name到该匹配的行
  • Keep scanning next rows with sequence_number higher than the last scanned 继续扫描sequence_number高于上次扫描的行

As stated before, there can be 1..n rows of values for such case. 如前所述,在这种情况下,可以有1..n行值。

Expected output 预期产量

number       name
1009       ProductA
2150       ProductB
3235       ProductC; ProductD; ProductE
1234       ProductF
7765       ProductG
4566       ProductH; ProductI
9907       ProductJ

My approach 我的方法

I believe I'm able to do this in a loop, by using Analytic Query and calculating LAG(1) and then concatenating the column name for one row with null values and discarding other column values from null row - and then doing this in a loop (for like 20 times assuming this is maximum), but I do consider this a bad idea. 我相信我可以使用Analytic Query并计算LAG(1) ,然后将具有空值的一行的列name连接起来,然后从空行中丢弃其他列的值,从而在一个循环中做到这一点-然后在循环(假设最大次数为20次),但我确实认为这是个坏主意。

There are probably better ways to achieve this result using for example Java Script step with scanning the rows backward from current (based on sequence number), but I'm unaware of those functions, if they do exist. 可能有更好的方法来实现此结果,例如,使用Java Script步骤并从当前位置开始向后扫描行(基于sequence号),但我不知道这些功能(如果存在)。

How can I achieve this using Modified Java Script Value step, or any other efficient way without using a loop for entire content of the file until there are no empty rows? 在没有空行的情况下,如何使用“ Modified Java Script Value步骤或任何其他有效方法来实现此目的,而又不对文件的整个内容使用循环?

To solve this, I would use Modified Java Script Value to save the last seen product and use this for all rows, and then use Group By to group the columns. 为了解决这个问题,我将使用“ Modified Java Script Value保存最后看到的产品并将其用于所有行,然后使用“ Group By依据”对列进行分组。

转型

使用Javascript

逐步分组

产量

Introduction 介绍

Merged adjacent cells in Excel files are presented on the image below. 下图显示了Excel文件中合并的相邻单元格。

具有合并单元格的Excel文件

When opened as a plain text file, it actually creates gaps (data from merged cell is missing) for every row but first that contains the merged cell. 当作为纯文本文件打开时,它实际上为每一行创建间隙(缺少合并单元格中的数据),但第一行包含合并单元格。

number  name
1000/P  um6p1
        um1p2
        um1p3
1500    um2p1
9823    um3p1
83424   um4p1
        um4p2
        um4p3
        um4p4
21390   um5p1

While @bolav answer addresses the problem, there is a simplier and probably more efficient approach to this issue in Kettle . 虽然@bolav回答解决这个问题,有一个simplier,可能更有效的方法对这一问题的Kettle


Approach 途径

  • In Microsoft Excel Input step go to Fields tab and mark Repeat option as Y for columns that store values in merged cells Microsoft Excel Input步骤中,转到“ Fields选项卡,然后将“ 重复”选项标记为Y ,以将值存储在合并的单元格中的列
  • Use Sort rows on number column because Group by step needs the input to be sorted 使用“对数字列进行Sort rows ,因为“ Group by需要对输入进行排序
  • Group by on field number and aggregate name with Concatenate strings separated by as type and ; Group by字段编号和集合名称 Group by并用Concatenate strings separated by类型和类型Concatenate strings separated by ; as value 作为价值

From Pentaho User Guide : 来自Pentaho用户指南

Repeat If set to Y, will repeat this value if the field in the next row is empty. 重复如果设置为Y,则如果下一行的字段为空,则将重复此值。

转型

逐步分组

执行结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM