简体繁体中英

Transform CSV into Parquet using Apache Flume?

原文 2022-02-24 08:53:58 8 1 apache-spark/ hadoop/ etl/ flume

I have a question, is it possible to execute ETL for data using flume. To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible?

If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?

1 answers

I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet . I feel nifi was the replacement for Apache Flume.

Flume partial answers:(Not Parquet) If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)

You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Is csv to parquet using pyspark distributed?

Converting Large 200mb csv file to Parquet file using Apache spark

Transform JSON into Parquet using EMR/Spark

Unable to Save Apache Spark parquet file to csv with Databricks

Convert CSV to parquet using Spark, preserving the partitioning

Csv Data is not loading properly as Parquet using Spark

How to convert a JSON file to parquet using Apache Spark?

Incrementally load data in parquet file using Apache Spark & Java

Save MongoDB data to parquet file format using Apache Spark

How to load mixed Parquet schema into DataFrame using Apache Spark?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Is csv to parquet using pyspark distributed? Converting Large 200mb csv file to Parquet file using Apache spark Transform JSON into Parquet using EMR/Spark Unable to Save Apache Spark parquet file to csv with Databricks Convert CSV to parquet using Spark, preserving the partitioning Csv Data is not loading properly as Parquet using Spark How to convert a JSON file to parquet using Apache Spark? Incrementally load data in parquet file using Apache Spark & Java Save MongoDB data to parquet file format using Apache Spark How to load mixed Parquet schema into DataFrame using Apache Spark?

Related Tags

Transform CSV into Parquet using Apache Flume?

Question

1 answers

solution1 0 ACCPTED 2022-02-24 14:40:17

solution1
0 ACCPTED 2022-02-24 14:40:17