简体   繁体   中英

Read avro files in pyspark with PyCharm

I'm quite new to spark, I've imported pyspark library to pycharm venv and write below code:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", 5)
path = "file_path"
df = spark.read.format("avro").load(path)

, everything seems to be okay but when I want to read avro file I get message:

pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

When I go to this page: >https://spark.apache.org/docs/latest/sql-data-sources-avro.html there is something like this:

在此处输入图片说明

and I have no idea have to implement this, download something in PyCharm or you have to find external files to modify?

Thank you for help!

Update (2019-12-06): Because I'm using Anaconda I've opened Anaconda prompt and copied this code:

pyspark --packages com.databricks:spark-avro_2.11:4.0.0

It downloaded some modules, then I've got back to PyCharm and same error appears.

I downloaded the pyspark version 2.4.4 package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar file in spark configuration and was able to sucessfully recreate your error ie, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;' pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

To fix this issue, follow below steps:

  1. Uninstall pyspark package downloaded from conda.
  2. Download and unzip spark-2.4.4-bin-hadoop2.7.tgz from here .
  3. In Run > Environment Varibales, you should set SPARK_HOME to <download_path>/spark-2.4.3-bin-hadoop2.7 and set PYTHONPATH to $SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python
  4. Download spark-avro_2.11-2.4.4.jar file from here .

Now you should be able to run pyspark code from PyCharm. Try below code:

# Imports
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext

#Create SparkSession
spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]')\
    .config('spark.jars', '<path>/spark-avro_2.11-2.4.4.jar') \
    .getOrCreate()


df = spark.read.format('avro').load('<path>/userdata1.avro')

df.show()

The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:

  1. In Project Structure, click on Add Content root and add spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip

Now your project structure should look like:

在此处输入图片说明

Output: 在此处输入图片说明

pyspark --jars /<path_to>/spark-avro_<version>.jar
Spark 3.0.2 对我有用

Simple solution can be submitting the module in Terminal tab inside pycharm with spark-submit command as below.

General syntax of command:

spark-submit --packages <package_name> <script_path>

As avro is the package needed com.databricks:spark-avro_2.11:4.0.0 package should be included. So the final command will be

spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 <script_path>

Your Spark version and your avro JAR version should be in sync
ex: If you're using spark 3.1.2 and your avro jar version should be spark-avro_2.12-3.1.2.jar
Sample Code:

spark = SparkSession.builder.appName('DataFrame').\
        config('spark.jars','C:\\Users\\<<User_Name>>\\Downloads\\spark-avro_2.12-3.1.2.jar').getOrCreate()
df = spark.read.format('avro').load('C:\\Users\\<<user name>>\\Downloads\\sample.avro')
df.show()

Output:
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
|           datetime|country|region|publisher_id|placement_id|       impression_id|consent|   hostname|                uuid|placement_type_id|iab_device_type_id|site_id|request_type|placement_type|bid_url_domain|app_bundle|                 tps|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
|2021-07-30 14:55:18|   null|  null|        5016|        5016|8bdf2cf1-3a17-473...|      4|test.server|9515d578-9ee0-462...|                0|                 5|   5016|      advast|         video|          null|      null|{5016 -> {5016, n...|
|2021-07-30 14:55:22|   null|  null|        2702|        2702|ab3b63d1-a916-4d7...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   2702|         adi|        banner|          null|      null|{2702 -> {2702, n...|
|2021-07-30 14:55:24|   null|  null|        1106|        1106|574f078f-0fc6-452...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   1106|         adi|        banner|          null|      null|{1106 -> {1106, n...|
|2021-07-30 14:55:25|   null|  null|        1107|        1107|54bf5cf8-3438-400...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   1107|         adi|        banner|          null|      null|{1107 -> {1107, n...|
|2021-07-30 14:55:27|   null|  null|        4277|        4277|b3508668-3ad5-4db...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   4277|         adi|        banner|          null|      null|{4277 -> {4277, n...|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM