简体   繁体   English

Apache NiFi:使用 ExecuteScript 处理器处理多个 csv

[英]Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

I have a csv with 70 columns.我有一个 70 列的 csv。 The 60th column contains a value which decides wether the record is valid or invalid .第 60 列包含一个值,该值决定记录是valid还是invalid If the 60th column has 0, 1, 6 or 7 it's valid .如果第 60 列有 0、1、6 或 7,则它是valid的。 If it contains any other value then its invalid .如果它包含任何其他值,那么它是invalid的。

I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi.我意识到这个功能不可能完全依赖于改变 Apache NiFi 中处理器的属性。 Therfore I decided to use the executeScript processor and added this python code as the text body.因此,我决定使用executeScript processor并将这个 python 代码添加为文本正文。

import csv

valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")

with  open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
    r = csv.reader(f)
    for row in f:
        # print row[1]
        total +=1

        if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
            valid +=1
            file1.write(row)
        else:
            invalid += 1
            file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))

I have no idea how to use a session and code within the executeScript processor as shown in this question .我不知道如何使用 session 和 executeScript 处理器中的代码,如本问题所示。 So I just wrote a simple python code and directed the valid and invalid data to different files.所以我只是写了一个简单的 python 代码,并将有效和无效的数据指向不同的文件。 This approach I have used has many limitations .我使用的这种方法有很多局限性

  1. I want to be able to dynamically process csv's with different filenames.我希望能够动态处理具有不同文件名的 csv。
  2. The csv which the invalid data is sent to, must also have the same filename as the input csv.发送无效数据的 csv 也必须与输入 csv 具有相同的文件名。
  3. There would be around 20 csv's in my redder folder.我的redder文件夹中将有大约 20 个 csv。 All of them must be processed in one go.所有这些都必须在一个 go 中处理。

Hope you could suggest a method for me to do the following.希望您能建议我执行以下操作的方法。 Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of ExecuteScript Processer随时通过编辑我使用的 python 代码为我提供解决方案,甚至完全使用一组不同的处理器,并且完全不使用ExecuteScript Processer处理器

Here is complete step-by-step instructions on how to use QueryRecord processor这是有关如何使用QueryRecord处理器的完整分步说明

Basically, you need to setup highlighted properties基本上,您需要设置突出显示的属性

在此处输入图像描述

You want to route records based on values from one column.您希望根据一列中的值路由记录。 There are various ways to make this happen in NiFi.在 NiFi 中有多种方法可以实现这一点。 I can think of the following:我可以想到以下几点:

I show you how to solve your problem using PartitionRecord processor.我将向您展示如何使用PartitionRecord处理器解决您的问题。 Since you did not provide any example data I created an example use case.由于您没有提供任何示例数据,我创建了一个示例用例。 I want to distinguish cities in Europe from cities elsewhere.我想将欧洲的城市与其他地方的城市区分开来。 Following data is given:给出以下数据:

id,city,country
1,Berlin,Germany
2,Paris,France
3,New York,USA
4,Frankfurt,Germany

Flow:流动:

在此处输入图像描述

GenerateFlowFile:生成流文件:

在此处输入图像描述

PartitionRecord:分区记录:

在此处输入图像描述

CSVReader should be setup to infer schema and CSVRecordSetWriter to inherit schema.应该设置CSVReader来推断模式和CSVRecordSetWriter来继承模式。 PartitionRecord will group records by country and pass them on together with an attribute country that has the country value. PartitionRecord将按国家对记录进行分组,并将它们与具有国家值的属性country一起传递。 You will see following groups of records:您将看到以下记录组:

id,city,country
1,Berlin,Germany
4,Frankfurt,Germany

id,city,country
2,Paris,France

id,city,country
3,New York,USA

Each group is a flowfile and will have the country attribute, which you will use to route the groups.每个组都是一个流文件,并且将具有国家属性,您将使用该属性来路由组。

RouteOnAttribute: RouteOn属性:

在此处输入图像描述

All countries from Europe will be routed to the is_europe relationship.来自欧洲的所有国家都将被路由到 is_europe 关系。 Now you can apply the same strategy to your use case.现在您可以将相同的策略应用于您的用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM