简体   繁体   English

Apache Nifi MergeContent输出数据不一致?

[英]Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. 对使用nifi来说还很新。 Need help with the design. 需要设计方面的帮助。 I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile. 我正在尝试使用HDFS目录中的虚拟csv文件(现在)创建一个简单的流,并在每个流文件中的每个记录前添加一些文本数据。

Incoming files: 传入文件:

dummy1.csv
dummy2.csv
dummy3.csv

contents: 内容:

"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...

Desired output: 所需的输出:

d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61

the flow 在此处输入图片说明 splitText- splitText- 在此处输入图片说明 ReplaceText- ReplaceText- 在此处输入图片说明 MergeContent- 合并内容

在此处输入图片说明 在此处输入图片说明

(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid) (这可能是实现我想要获得的目标的一种糟糕方法,但是我发现在某些地方,uuid最好用于生成唯一的会话id。因此,请考虑从输入数据中提取每一行到flowfile并生成uuid)

But somehow, as you can see the order of data is messing up. 但是不知何故,正如您所看到的,数据顺序混乱了。 The first 3 rows are not the same in output. 前三行的输出不相同。 However, the test data I am using (50000 entries) seems to have the data in some other line. 但是,我正在使用的测试数据(50000个条目)似乎在其他行中有该数据。 Multiple tests show usually the data order changes after 2001st line. 多次测试通常显示数据顺序在2001st行之后发生变化。

And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. 是的,我确实在这里搜索了类似的问题,并尝试在合并中使用碎片整理方法,但没有成功。 I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. 如果有人可以解释这里发生的事情,以及如何以相同的方式获取每条记录的唯一session_id,timestamp的数据,我将不胜感激。 Is there some parameter I need to change or modify to get the correct output? 是否需要更改或修改某些参数以获得正确的输出? I am open to suggestions if there is a better way as well. 如果还有更好的方法,我也乐于提出建议。

First of all thank you for such an elaborate and detailed response. 首先,感谢您如此详尽详尽的答复。 I think you cleared a lot of doubts I had as to how the processor works! 我认为您清除了我对处理器工作方式的许多疑问!

The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. 仅在碎片整理模式下才能保证合并的顺序,因为它将根据流文件的片段索引对流文件进行排序。 I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug. 我不确定为什么这样做不起作用,但是如果您可以使用显示问题的样本数据创建流的模板,则这将有助于调试。

I will try to replicate this method using a clean template again. 我将尝试再次使用干净的模板来复制此方法。 Could be some parameter problem and the HDFS writer not able to write. 可能是某些参数问题而HDFS写入器无法写入。

I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. 我不确定您的流程的意图是只是重新合并已拆分的原始CSV,还是将几种不同的CSV合并在一起。 Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs. 碎片整理模式将仅重新合并原始CSV,因此,如果ListHDFS选取了10个CSV,则在拆分和重新合并后,您应该又有10个CSV。

Yes, that is exactly what I need. 是的,这正是我所需要的。 Split and join data to their corresponding files. 将数据拆分并合并到其相应文件。 I dont specifically (yet) need to join the outputs again. 我不特别(尚未)需要再次加入输出。

The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. 将每个流文件的CSV拆分为1行以操纵每一行的方法是一种常见的方法,但是,如果您有许多大的CSV文件,则效果会不佳。 A more efficient approach would be to try and manipulate the data in place without splitting . 一种更有效的方法是尝试在不分裂的情况下就地操纵数据 This can generally be done with the record-oriented processors. 这通常可以通过面向记录的处理器来完成。

  1. I used this approach purely instinctively and did not realize this is a common method. 我纯粹是出于本能地使用这种方法,却没有意识到这是一种常见的方法。 Sometimes the datafile could be very large, that means more than a million records in a single file. 有时数据文件可能非常大,这意味着单个文件中有超过一百万条记录。 Wont that be an issue with the i/o in the cluster? 集群中的I / O不会有问题吗? coz that would mean each record=one flowfile=one unique uuid. 因为这意味着每个记录=一个流文件=一个唯一的uuid。 What is a comfortable number of flowfiles that nifi can handle? Nifi可以处理多少个流文件? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin) (我知道这取决于群集配置,并将尝试从hdp admin获取有关群集的更多信息)
  2. What do you suggest by "try and manipulate the data in place without splitting" ? 通过“在不分裂的情况下尝试并处理数据”,您有何建议? can you give an example or template or processor to use? 您可以举一个示例或模板或处理器来使用吗?

In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. 在这种情况下,您需要为CSV定义一个架构,其中包括数据中的所有列以及会话ID和时间戳。 Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. 然后,使用UpdateRecord处理器,您将使用记录路径表达式,例如/ session_id = $ {UUID()}和/ timestamp = $ {now()}。 This would stream the content line by line and update each record and write it back out, keeping it all as one flow file. 这将逐行流传输内容,并更新每条记录并将其写回,将所有内容保留为一个流文件。

This looks promising. 这看起来很有希望。 Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting? 您可以共享一个简单的模板,从hdfs>处理>写入hdfs文件中提取文件但不拆分吗?

I am reluctant to share the template due to restrictions. 由于限制,我不愿意共享该模板。 But let me see if I can create a generic templ and I will share 但是,让我看看是否可以创建一个通用的模板并分享

Thank you for your wisdom! 谢谢您的智慧! :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM