简体   繁体   English

你如何在 jupyter notebook 中阅读 avros? (皮斯帕克)

[英]How do you read avros in jupyter notebook? (Pyspark)

I haven't been able to read avros inside Jupyter Notebook.我无法在 Jupyter Notebook 中阅读 avros。 When I use these commands:当我使用这些命令时:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
path = "C:/Users/hp/avrofile/"
x = spark.read.format("com.databricks.spark.avro").load(path)

I get this huge error:我得到这个巨大的错误:

> --------------------------------------------------------------------------- Py4JJavaError                             Traceback (most recent call
> last) <ipython-input-6-16978c1d2487> in <module>
>       1 path = "C:/Users/hp/avrofile/"
> ----> 2 x = spark.read.format("com.databricks.spark.avro").load(path)
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\readwriter.py
> in load(self, path, format, schema, **options)
>     164         self.options(**options)
>     165         if isinstance(path, basestring):
> --> 166             return self._df(self._jreader.load(path))
>     167         elif path is not None:
>     168             if type(path) != list:
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\java_gateway.py
> in __call__(self, *args)    1255         answer =
> self.gateway_client.send_command(command)    1256         return_value
> = get_return_value(
> -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258     1259         for temp_arg in temp_args:
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\utils.py
> in deco(*a, **kw)
>      61     def deco(*a, **kw):
>      62         try:
> ---> 63             return f(*a, **kw)
>      64         except py4j.protocol.Py4JJavaError as e:
>      65             s = e.java_exception.toString()
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>     326                 raise Py4JJavaError(
>     327                     "An error occurred while calling {0}{1}{2}.\n".
> --> 328                     format(target_id, ".", name), value)
>     329             else:
>     330                 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling o62.load. :
> java.lang.ClassNotFoundException: Failed to find data source:
> org.apache.spark.sql.avro.AvroFileFormat. Please find packages at
> http://spark.apache.org/third-party-projects.html     at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
>   at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)     at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)  at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)    at
> py4j.Gateway.invoke(Gateway.java:282)     at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)   at
> py4j.GatewayConnection.run(GatewayConnection.java:238)    at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource    at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)    at
> java.lang.ClassLoader.loadClass(ClassLoader.java:424)     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)     at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>   at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>   at scala.util.Try$.apply(Try.scala:192)     at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>   at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>   at scala.util.Try.orElse(Try.scala:84)  at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
>   ... 13 more

Now, you see, I've realized that when I start Pyspark through the cmd window with this command:现在,你看,我已经意识到,当我使用以下命令通过 cmd window 启动 Pyspark 时:

pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0

I can read avros no problem:我可以阅读 avros 没问题:

x = spark.read.format("avro").load("C:\\Users\\avrofile\\")
x.show(5)

The thing is, in jupyter notebook, what's the equivalent of starting spark with the command "pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0"?问题是,在 jupyter notebook 中,使用命令“pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0”启动 spark 的等价物是什么? I feel like this is an extremely noobish question but, I'm sorry, I'm super new to this.我觉得这是一个非常愚蠢的问题,但是,对不起,我对此很陌生。

Thank you so much.太感谢了。

Check if this solution works for you,检查此解决方案是否适合您,

  • Download the required jar spark-avro_2.11-3.2.0.jar, place in right folder.下载所需的jar spark-avro_2.11-3.2.0.jar,放入右侧文件夹。 Here I am referring for example c:\users\hp\spark-avro_2.11-3.2.0.jar location这里我指的是例如 c:\users\hp\spark-avro_2.11-3.2.0.jar 位置
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars c:\users\hp\spark-avro_2.11-3.2.0.jar' x = spark.read.format("avro").load("C:\\Users\\avrofile\\") x.show(5)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM