简体繁体 English

根据文本文件输入元数据进行数据库插入

[英]Make a DB INSERT based on Text File Input metadata

原文 2017-08-28 16:17:19 6 3 monitoring/ pentaho/ etl/ pentaho-spoon/ pentaho-data-integration

I'm developing an ETL and must do some routines for monitoring it. 我正在开发ETL，并且必须执行一些例程来对其进行监视。

At the begining, I must make in INSERT on DB to create a record informing the filename and starting process datetime. 开始时，我必须在DB上的INSERT中创建一条记录，以通知文件名和开始过程的日期时间。 This query will return the record's PK and it must be stored. 该查询将返回记录的PK，并且必须将其存储。 When the ETL of that file finishes, I must update that record informing the ETL finished with success and its ending process datetime. 当该文件的ETL完成时，我必须更新该记录，以告知ETL成功完成及其结束过程的日期时间。

I use Text File Input to look for files that match its regex, and add its "Additional output fields" to stream. 我使用文本文件输入来查找与其正则表达式匹配的文件，并将其“其他输出字段”添加到流中。 But I can't find a component that will run only for first record and will execute a SQL command for the INSERT. 但是我找不到一个仅可用于第一条记录并且将对INSERT执行SQL命令的组件。

3 个解决方案

You can use "Identify last row" and "Filter rows" together, so you will keep only one line from your input (filtering just the last one). 您可以同时使用“标识最后一行”和“过滤行”，因此您将仅保留输入中的一行（仅过滤最后一行）。 You INSERT will be right after the Filter Rows step. 您将在“过滤器行”步骤之后立即插入。

As you will need to split your flow, you'll need to join your ID column with the original text input rows. 由于需要拆分流程，因此需要将ID列与原始文本输入行连接在一起。

You also have a Unique row . 您还具有Unique row 。 If you do not specify on which field to filter a unique value, it will output one and exactly one row. 如果您未指定在哪个字段上过滤唯一值，则它将只输出一行。

Now, unless I misunderstood your specs, I'd rather use Kettle's logging system . 现在，除非我误解了您的规格，否则我宁愿使用Kettle的日志记录系统。 Click anywhere, select properties on the popup, then Logging tab. 单击任意位置，在弹出窗口中选择属性，然后单击“日志记录”选项卡。 It will give you the status (Started/End/Stop/...) and plenty of additional info, like the number of errors, the line read and written (just tell the PDI on which step it has to look for these numbers). 它将为您提供状态（开始/结束/停止/ ...）和大量其他信息，例如错误数，读取和写入的行（只需告诉PDI它必须在哪一步上寻找这些数字）。。

You can even read almost real-time in the DB the same information as you see on the bottom panel of the PDI. 您甚至可以几乎实时地在数据库中读取与PDI底部面板上相同的信息。 Just click the fields you want and press the SQL button to create the file. 只需单击所需的字段，然后按SQL按钮即可创建文件。

Just note that, for historical reasons, the start date is not really the start dte (it's the date of the previous successful run). 请注意，由于历史原因，开始日期并不是真正的开始日期（它是前一次成功运行的日期）。 The start date is called Replay date . 开始日期称为“ Replay date 。

And also if you need this system to monitor the load and know if the run has to start or nor not, take care that on abrupt ending the system does sometimes not have the time to write "End" to the log. 另外，如果您需要该系统监视负载并知道运行是否必须开始，请当心，在系统突然终止时，有时没有时间将“ End”写入日志。 Therefore a logdate<now-10minutes is more reliable. 因此， logdate<now-10minutes更可靠。