Since Spark 2.4.0 it's possible to save as AVRO without external jars. However I can't get it working at all. My code looks like this:
key = 'filename.avro'
df.write.mode('overwrite').format("avro").save(key)
I get the following error:
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
So I look at the Apache Avro Data Source Guide ( https://spark.apache.org/docs/latest/sql-data-sources-avro.html ) and it gives the following example:
df=spark.read.format("avro").load("examples/src/main/resources/users.avro")
df.select("name","favorite_color").write.format("avro").save("namesAndFavColors.avro")
It is the same, so I'm lost.. Anyone have an idea what goes wrong?
The documentation you've linked clearly says that:
The spark-avro module is external and not included in spark-submit or spark-shell by default.
and further explains how to include the package.
So your statement:
Since Spark 2.4.0 it's possible to save as AVRO without external jars. H
is just incorrect.
The spark-avro module is external and not included in spark-submit or spark-shell by default.
As with any Spark applications, spark-submit is used to launch your application. spark-avro_2.11 and its dependencies can be directly added to spark-submit using --packages
, such as,
./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 ...
For experimenting on spark-shell, you can also use --packages
to add org.apache.spark:spark-avro_2.11 and its dependencies directly,
./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.