简体   繁体   中英

Pentaho Import uniqe records into database

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. However, only unique records should be imported into the database table. That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not.

So far, I tried out the suggested CRUD-pattern which looks like this: 在此处输入图片说明

As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0)

After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good.

But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table.

How can I realize this? Any suggestions? Thank you so much in advance!!

The Merge Rows (diff) expect the input to be sorted. Normally, you have been warned of this by a pop-up.

Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff) .

You should do the same between the Table Input and the Merge Rows (diff) . On course you may think you could do it in the sql statement of the Table Input .

However, there is a beginner trap here. You have 3 other steps Output Rows , Update and Delete which operates on the same table. And these steps may lock the table. As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. This is known in jargon as an auto-lock , and the way to solve it is to put a Sort Row step as a buffer .

You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve.

Thanks, Nilesh

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM