简体繁体 English

Pentaho将uniqe记录导入数据库

[英]Pentaho Import uniqe records into database

原文 2017-11-22 08:16:55 7 2 mysql/ database/ pentaho/ pentaho-spoon/ pentaho-data-integration

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. 我对Pentaho Spoon还是很陌生，我想将一个csv文件的记录导入数据库表。 However, only unique records should be imported into the database table. 但是，仅唯一记录应导入数据库表中。 That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not. 这就是为什么我需要将EACH记录与数据库表的所有记录进行比较，以确定是否应该导入记录。

So far, I tried out the suggested CRUD-pattern which looks like this: 到目前为止，我尝试了建议的CRUD模式，如下所示：

As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0) 如您在图片中看到的，我合并了excel输入和表输入（忽略转换步骤。我需要转换一个值，因为它们在float格式方面有所不同：数据库格式为＃.000000，而float的csv格式是＃.0）

After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good. 合并联接后，我比较标志（由合并行（diff）给出），如果比较的记录是新记录，则将它们导入数据库表，如果它们被更改，我将更新记录并删除它们或相同，我只是什么都不做，到目前为止，很好。

But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). 但这是问题所在：如果我重新整理csv-input-file的记录并重新运行转换，则所有记录都将重新导入，因此，数据库表中有重复的记录（我想避免）。 To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table. 再次强调：解决此问题的正确方法是将csv-input-file的每一行与数据库表中的ALL条目进行比较。

How can I realize this? 我怎么能意识到这一点？ Any suggestions? 有什么建议么？ Thank you so much in advance!! 提前非常感谢您！！

2 个解决方案

The Merge Rows (diff) expect the input to be sorted. Merge Rows (diff)期望对输入进行排序。 Normally, you have been warned of this by a pop-up. 通常，会通过弹出窗口警告您。

Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff) . 在到达“ Merge Rows (diff)之前，在“ Excel输入”的输出流上放置一个“ Sort rows步骤。

You should do the same between the Table Input and the Merge Rows (diff) . 您应该在Table Input和Merge Rows (diff)之间执行相同的操作。 On course you may think you could do it in the sql statement of the Table Input . 当然，您可能会认为您可以在Table Input的sql语句中完成此操作。

However, there is a beginner trap here. 但是，这里有一个初学者陷阱。 You have 3 other steps Output Rows , Update and Delete which operates on the same table. 您还有其他3个步骤，在同一表上执行Output Rows ， Update和Delete 。 And these steps may lock the table. 这些步骤可能会锁定表格。 As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. 就像在Kettle中一样，所有步骤都同时运行，因此您不知道首先执行哪些步骤，并且该表可能被锁定，甚至无法读取第一条记录。 This is known in jargon as an auto-lock , and the way to solve it is to put a Sort Row step as a buffer . 用专业术语将其称为自动锁定 ，解决方法是将“ Sort Row步骤作为缓冲区 。

You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve. 您可以使用“维度查找/更新”控件，该控件提供您尝试实现的相同功能。

Thanks, Nilesh 谢谢，尼罗什