简体   繁体   English

NIFI-QueryDatabaseTable处理器。 如何查询被修改的行?

[英]NIFI - QueryDatabaseTable processor. How to query rows which is modified?

I am working on NIFI Data Flow where my usecase is fetch mysql table data and put into hdfs/local file system. 我正在使用NIFI Data Flow,我的用例是获取mysql表数据并放入hdfs / local文件系统中。

I have built a data flow pipeline where i used querydatabaseTable processor ------ ConvertRecord --- putFile processor. 我建立了一个数据流管道,其中使用了querydatabaseTable处理器------ ConvertRecord --- putFile处理器。

My Table Schema ---> id,name,city,Created_date 我的表格架构---> ID,名称,城市,创建日期

I am able to receive files in destination even when i am inserting new records in table 即使我在表中插入新记录,我也能在目的地接收文件

But, but .... 但是,但是..

When i am updating exsiting rows then processor is not fetching those records looks like it has some limitation. 当我更新现有行时,处理器不会提取这些记录,看起来有一定的局限性。

My Question is ,How to handle this scenario? 我的问题是,如何处理这种情况? either by any other processor or need to update some property. 或者由其他任何处理器处理,或者需要更新某些属性。

PLease someone help @Bryan Bende 请有人帮忙@Bryan Bende 在此处输入图片说明

QueryDatabaseTable Processor needs to be informed which columns it can use to identify new data. 需要告知QueryDatabaseTable Processor,它可以使用哪些列来标识新数据。

A serial id or created timestamp is not sufficient. 串行idcreated时间戳是不够的。

From the documentation: 从文档中:

Maximum-value Columns: 最大值列:

A comma-separated list of column names. 列名的逗号分隔列表。 The processor will keep track of the maximum value for each column that has been returned since the processor started running. 自处理器开始运行以来,处理器将跟踪已返回的每一列的最大值。 Using multiple columns implies an order to the column list, and each column's values are expected to increase more slowly than the previous columns' values. 使用多列意味着列列表的顺序,并且期望每列的值增加的速度比前一列的值慢。 Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. 因此,使用多个列意味着列的层次结构,通常用于分区表。 This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. 该处理器只能用于检索自上次检索以来已添加/更新的那些行。 Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. 请注意,某些JDBC类型(例如bit / boolean)不利于保持最大值,因此这些类型的列不应在此属性中列出,并且会在处理过程中导致错误。 If no columns are provided, all rows from the table will be considered, which could have a performance impact. 如果未提供任何列,则将考虑表中的所有行,这可能会对性能产生影响。 NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly. 注意:对于给定的表使用一致的最大值列名称很重要,这样增量提取才能正常工作。

Judging be the table scheme, there is no sql-way of telling whether data was updated. 从表方案来看,没有sql方法可以判断数据是否已更新。

There are many ways to solve this. 有很多解决方法。 In your case, the easiest thing to do might be to rename column created to modified and set to now() on updates or to work with a second timestamp column. 对于您而言,最简单的操作可能是重命名createdmodified列,并在更新时将其设置为now()或使用第二个timestamp列。

So for instance 例如

| stamp_updated | timestamp | CURRENT_TIMESTAMP   | on update CURRENT_TIMESTAMP |

is the new column added. 是添加的新列。 In the processor you use the stamp_updated column to identify new data 在处理器中,您可以使用stamp_updated列来标识新数据 处理器属性

Don't forget to set Maximum-value Columns to those columns. 不要忘记将“ Maximum-value Columns设置为这些列。

So what I am basically saying is: 所以我基本上要说的是:

If you cannot tell that it is a new record in sql yourself, nifi cannot either. 如果您自己不能确定它是sql中的新记录,则nifi也不能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM