我是否需要安装 Hadoop 才能使用 Pyspark 的所有方面？

Question

I've installed pyspark, but have not installed any hadoop or spark version seperatly.我已经安装了 pyspark，但还没有单独安装任何 hadoop 或 spark 版本。

Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (eg writing files to disk).显然，在 Windows 下 pyspark 需要访问 Hadoop 的 winutils.exe 以进行某些操作（例如将文件写入磁盘）。 When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable).当 pyspark 想要访问 winutilis.exe 时，它会在 HADOOP_HOME 环境变量（用户变量）指定的文件夹的 bin 目录中查找它。 Therefore I copied the winutils.exe into the bin directory of pyspark ( .\\site-packages\\pyspark\\bin ) and specified HADOOP_HOME as .\\site-packages\\pyspark\\ .因此，我将 winutils.exe 复制到 pyspark ( .\\site-packages\\pyspark\\bin ) 的 bin 目录中，并将 HADOOP_HOME 指定为.\\site-packages\\pyspark\\ 。 This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path .这解决了获取错误消息的问题： Failed to locate the winutils binary in the hadoop binary path 。

However, when I start a Spark session using pyspark I still get the following warning:但是，当我使用 pyspark 启动 Spark 会话时，我仍然收到以下警告：

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning.安装 Hadoop 然后为 HADDOP_HOME 指定其安装目录确实阻止了警告。 Has a specific hadoop version to be installed to make pyspark work without restrictions?是否要安装特定的 hadoop 版本以使 pyspark 不受限制地工作？

Answer 1

Hadoop installation is not mandatory. Hadoop 安装不是强制性的。

Spark is distributed computing engine only. Spark 只是分布式计算引擎。

Spark offers only computation & it doesn't have any storage. Spark 只提供计算，它没有任何存储。 But Spark is integrated with huge variety of storage systems like HDFS, Cassandra, HBase, Mongo DB, Local file system etc....但是 Spark 集成了种类繁多的存储系统，如 HDFS、Cassandra、HBase、Mongo DB、本地文件系统等......

Spark is designed to run on top of variety of resource management platforms like Spark, Mesos, YARN, Local, Kubernetes etc.... Spark 旨在运行在各种资源管理平台之上，例如 Spark、Mesos、YARN、Local、Kubernetes 等......

PySpark is Python API on top of Spark to develop Spark applications in Python. PySpark 是在 Spark 之上的 Python API，用于在 Python 中开发 Spark 应用程序。 So Hadoop installation is not mandatory.所以Hadoop安装不是强制性的。

Note: Hadoop Installation is only required either to run Pyspark application on top of YARN or to access input/output of Pyspark application from/to HDFS/Hive/HBase or Both.注意：Hadoop 安装只需要在 YARN 之上运行 Pyspark 应用程序或从/到 HDFS/Hive/HBase 或两者访问 Pyspark 应用程序的输入/输出。

About the warning you posted is normal one.关于您发布的警告是正常的。 So ignore it.所以忽略它。

我是否需要安装 Hadoop 才能使用 Pyspark 的所有方面？

问题描述

1 个解决方案

解决方案1
1 2020-03-24 13:53:09

我是否需要安装 Hadoop 才能使用 Pyspark 的所有方面？

问题描述

1 个解决方案

解决方案1 1 2020-03-24 13:53:09

解决方案1
1 2020-03-24 13:53:09