[英]Apache NiFi: Processing multiple csv's using the ExecuteScript Processor
I have a csv with 70 columns.我有一个 70 列的 csv。 The 60th column contains a value which decides wether the record is
valid
or invalid
.第 60 列包含一个值,该值决定记录是
valid
还是invalid
。 If the 60th column has 0, 1, 6 or 7 it's valid
.如果第 60 列有 0、1、6 或 7,则它是
valid
的。 If it contains any other value then its invalid
.如果它包含任何其他值,那么它是
invalid
的。
I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi.我意识到这个功能不可能完全依赖于改变 Apache NiFi 中处理器的属性。 Therfore I decided to use the
executeScript processor
and added this python code as the text body.因此,我决定使用
executeScript processor
并将这个 python 代码添加为文本正文。
import csv
valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")
with open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
r = csv.reader(f)
for row in f:
# print row[1]
total +=1
if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
valid +=1
file1.write(row)
else:
invalid += 1
file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))
I have no idea how to use a session and code within the executeScript processor as shown in this question .我不知道如何使用 session 和 executeScript 处理器中的代码,如本问题所示。 So I just wrote a simple python code and directed the valid and invalid data to different files.
所以我只是写了一个简单的 python 代码,并将有效和无效的数据指向不同的文件。 This approach I have used has many limitations .
我使用的这种方法有很多局限性。
redder
folder.redder
文件夹中将有大约 20 个 csv。 All of them must be processed in one go. Hope you could suggest a method for me to do the following.希望您能建议我执行以下操作的方法。 Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of
ExecuteScript Processer
随时通过编辑我使用的 python 代码为我提供解决方案,甚至完全使用一组不同的处理器,并且完全不使用
ExecuteScript Processer
处理器
Here is complete step-by-step instructions on how to use QueryRecord
processor这是有关如何使用
QueryRecord
处理器的完整分步说明
Basically, you need to setup highlighted properties基本上,您需要设置突出显示的属性
You want to route records based on values from one column.您希望根据一列中的值路由记录。 There are various ways to make this happen in NiFi.
在 NiFi 中有多种方法可以实现这一点。 I can think of the following:
我可以想到以下几点:
I show you how to solve your problem using PartitionRecord
processor.我将向您展示如何使用
PartitionRecord
处理器解决您的问题。 Since you did not provide any example data I created an example use case.由于您没有提供任何示例数据,我创建了一个示例用例。 I want to distinguish cities in Europe from cities elsewhere.
我想将欧洲的城市与其他地方的城市区分开来。 Following data is given:
给出以下数据:
id,city,country
1,Berlin,Germany
2,Paris,France
3,New York,USA
4,Frankfurt,Germany
Flow:流动:
GenerateFlowFile:生成流文件:
PartitionRecord:分区记录:
CSVReader
should be setup to infer schema and CSVRecordSetWriter
to inherit schema.应该设置
CSVReader
来推断模式和CSVRecordSetWriter
来继承模式。 PartitionRecord
will group records by country and pass them on together with an attribute country
that has the country value. PartitionRecord
将按国家对记录进行分组,并将它们与具有国家值的属性country
一起传递。 You will see following groups of records:您将看到以下记录组:
id,city,country
1,Berlin,Germany
4,Frankfurt,Germany
id,city,country
2,Paris,France
id,city,country
3,New York,USA
Each group is a flowfile and will have the country attribute, which you will use to route the groups.每个组都是一个流文件,并且将具有国家属性,您将使用该属性来路由组。
RouteOnAttribute: RouteOn属性:
All countries from Europe will be routed to the is_europe relationship.来自欧洲的所有国家都将被路由到 is_europe 关系。 Now you can apply the same strategy to your use case.
现在您可以将相同的策略应用于您的用例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.