简体   繁体   中英

Save dataframe as AVRO Spark 2.4.0

Since Spark 2.4.0 it's possible to save as AVRO without external jars. However I can't get it working at all. My code looks like this:

key = 'filename.avro'
df.write.mode('overwrite').format("avro").save(key)

I get the following error:

pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

So I look at the Apache Avro Data Source Guide ( https://spark.apache.org/docs/latest/sql-data-sources-avro.html ) and it gives the following example:

df=spark.read.format("avro").load("examples/src/main/resources/users.avro")

df.select("name","favorite_color").write.format("avro").save("namesAndFavColors.avro")

It is the same, so I'm lost.. Anyone have an idea what goes wrong?

The documentation you've linked clearly says that:

The spark-avro module is external and not included in spark-submit or spark-shell by default.

and further explains how to include the package.

So your statement:

Since Spark 2.4.0 it's possible to save as AVRO without external jars. H

is just incorrect.

The spark-avro module is external and not included in spark-submit or spark-shell by default.

As with any Spark applications, spark-submit is used to launch your application. spark-avro_2.11 and its dependencies can be directly added to spark-submit using --packages , such as,

./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 ...

For experimenting on spark-shell, you can also use --packages to add org.apache.spark:spark-avro_2.11 and its dependencies directly,

./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM