简体   繁体   English

在Pentaho水壶中,如何检查文件名是否存在?

[英]In Pentaho kettle, how to check the filename is exists or not?

I am new to pentaho kettle... 我是pentaho水壶的新手...

For now, I have a folder contain many .txt files. 现在,我有一个包含许多.txt文件的文件夹。

Let say for example: 20121012.txt, 20121014.txt..... 举例来说:20121012.txt,20121014.txt .....

Everytime I run the kettle job, it will grep all these files for import into database. 每次我运行水壶作业时,它将grep所有这些文件以导入数据库。

I need to handle the checking before import into db to prevent data duplication. 我需要在导入db之前处理检查,以防止数据重复。

The problem is that, how can I let the kettle notice the filename which is already imported? 问题是,如何让水壶注意到已经导入的文件名?

For example: 例如:

20121012.txt <=if this file is imported, it will check the filename of it on next time, if it is same filename, then it will be not imported. 20121012.txt <=如果该文件被导入,它将在下一次检查它的文件名,如果它是相同的文件名,则不会被导入。

In this case, I cannot just simply set a specific file "20121012.txt" in the step "Check if files exists". 在这种情况下,我不能仅在“检查文件是否存在”步骤中简单地设置特定文件“ 20121012.txt”。 It was because the txt file is large amount. 这是因为txt文件很大。 If the filename refer to a day, then 1 year contain 365-366 days. 如果文件名是一天,则1年包含365-366天。 I cannot hard code all days file in this way. 我无法以这种方式对所有文件进行硬编码。

So, the possible way is to check the filename of that process file whether the filename is existed before import into database. 因此,可能的方法是在导入数据库之前检查该过程文件的文件名是否存在。

And that is my question that how can I do this? 这就是我的问题,我该怎么做? What steps or work flow I need to use? 我需要使用哪些步骤或工作流程? Could anyone provide the detail step that is possible to do this? 任何人都可以提供可以执行此操作的详细步骤吗?

I am looking forward to hearing from you and please let me know if you need more information. 希望收到您的来信,如果您需要更多信息,请告诉我。

Thanks all for helping! 谢谢大家的帮助!

You can do this by storing the already processed file list in a place like a table in the database. 您可以通过将已处理的文件列表存储在数据库中的表之类的位置来完成此操作。 Load in the table in another step, then join the streams from the steps with a merge and pass through only those files from the file load step that are not in the other stream. 在另一步骤中加载表,然后通过合并将步骤中的流合并,并仅传递文件加载步骤中不在其他流中的那些文件。

Make sure to later update your already processed table with any newly processed files later on. 确保以后再用所有新处理的文件更新已处理过的表。

You can use "Get File Names" step. 您可以使用“获取文件名”步骤。 In this step: set the folder(s) which store your files, and then set the wildcard (for example ".*" if you want all files from folder). 在此步骤中:设置用于存储文件的文件夹,然后设置通配符(例如,如果要从文件夹中获取所有文件,则为“。*”)。

If your database stores already imported filenames, you can make your transformation indepotent by using "Database Lookup" to check if your filename is already in database, and then filter on a stream, to pass only filenames that weren't found in the database. 如果数据库存储了已导入的文件名,则可以使用“数据库查找”检查文件名是否已存在于数据库中,然后过滤流以仅传递数据库中未找到的文件名,从而使转换独立。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM