简体   繁体   中英

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume. To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible?

If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?

I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet . I feel nifi was the replacement for Apache Flume.

Flume partial answers:(Not Parquet) If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)

You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM