I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
( INSERT INTO customer (id, name, address) VALUES (x, x, x)
)
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro
table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc
table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records : See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.