Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

Question

I have a csv with 70 columns. The 60th column contains a value which decides wether the record is valid or invalid . If the 60th column has 0, 1, 6 or 7 it's valid . If it contains any other value then its invalid .

I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi. Therfore I decided to use the executeScript processor and added this python code as the text body.

import csv

valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")

with  open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
    r = csv.reader(f)
    for row in f:
        # print row[1]
        total +=1

        if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
            valid +=1
            file1.write(row)
        else:
            invalid += 1
            file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))

I have no idea how to use a session and code within the executeScript processor as shown in this question . So I just wrote a simple python code and directed the valid and invalid data to different files. This approach I have used has many limitations .

I want to be able to dynamically process csv's with different filenames.
The csv which the invalid data is sent to, must also have the same filename as the input csv.
There would be around 20 csv's in my redder folder. All of them must be processed in one go.

Hope you could suggest a method for me to do the following. Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of ExecuteScript Processer

Answer 1

Here is complete step-by-step instructions on how to use QueryRecord processor

Basically, you need to setup highlighted properties

Answer 2

You want to route records based on values from one column. There are various ways to make this happen in NiFi. I can think of the following:

Use QueryRecord processor to partition records by column values
Use RouteOnContent processor to route using a regular expression
Use ExecuteScript processor to create a custom routing logic
Use PartitionRecord processor to route based on RecordPaths

I show you how to solve your problem using PartitionRecord processor. Since you did not provide any example data I created an example use case. I want to distinguish cities in Europe from cities elsewhere. Following data is given:

id,city,country
1,Berlin,Germany
2,Paris,France
3,New York,USA
4,Frankfurt,Germany

Flow:

GenerateFlowFile:

PartitionRecord:

CSVReader should be setup to infer schema and CSVRecordSetWriter to inherit schema. PartitionRecord will group records by country and pass them on together with an attribute country that has the country value. You will see following groups of records:

id,city,country
1,Berlin,Germany
4,Frankfurt,Germany

id,city,country
2,Paris,France

id,city,country
3,New York,USA

Each group is a flowfile and will have the country attribute, which you will use to route the groups.

RouteOnAttribute:

All countries from Europe will be routed to the is_europe relationship. Now you can apply the same strategy to your use case.

Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

Question

2 answers

solution1
2 2019-11-06 00:34:58

solution2
1 ACCPTED 2019-11-07 18:47:44

Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

Question

2 answers

solution1 2 2019-11-06 00:34:58

solution2 1 ACCPTED 2019-11-07 18:47:44

solution1
2 2019-11-06 00:34:58

solution2
1 ACCPTED 2019-11-07 18:47:44