简体   繁体   中英

In Pentaho kettle, how to check the filename is exists or not?

I am new to pentaho kettle...

For now, I have a folder contain many .txt files.

Let say for example: 20121012.txt, 20121014.txt.....

Everytime I run the kettle job, it will grep all these files for import into database.

I need to handle the checking before import into db to prevent data duplication.

The problem is that, how can I let the kettle notice the filename which is already imported?

For example:

20121012.txt <=if this file is imported, it will check the filename of it on next time, if it is same filename, then it will be not imported.

In this case, I cannot just simply set a specific file "20121012.txt" in the step "Check if files exists". It was because the txt file is large amount. If the filename refer to a day, then 1 year contain 365-366 days. I cannot hard code all days file in this way.

So, the possible way is to check the filename of that process file whether the filename is existed before import into database.

And that is my question that how can I do this? What steps or work flow I need to use? Could anyone provide the detail step that is possible to do this?

I am looking forward to hearing from you and please let me know if you need more information.

Thanks all for helping!

You can do this by storing the already processed file list in a place like a table in the database. Load in the table in another step, then join the streams from the steps with a merge and pass through only those files from the file load step that are not in the other stream.

Make sure to later update your already processed table with any newly processed files later on.

You can use "Get File Names" step. In this step: set the folder(s) which store your files, and then set the wildcard (for example ".*" if you want all files from folder).

If your database stores already imported filenames, you can make your transformation indepotent by using "Database Lookup" to check if your filename is already in database, and then filter on a stream, to pass only filenames that weren't found in the database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM