简体   繁体   English

我是否需要安装 Hadoop 才能使用 Pyspark 的所有方面?

[英]Do I need to install Hadoop in order to use all aspects of Pyspark?

I've installed pyspark, but have not installed any hadoop or spark version seperatly.我已经安装了 pyspark,但还没有单独安装任何 hadoop 或 spark 版本。

Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (eg writing files to disk).显然,在 Windows 下 pyspark 需要访问 Hadoop 的 winutils.exe 以进行某些操作(例如将文件写入磁盘)。 When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable).当 pyspark 想要访问 winutilis.exe 时,它​​会在 HADOOP_HOME 环境变量(用户变量)指定的文件夹的 bin 目录中查找它。 Therefore I copied the winutils.exe into the bin directory of pyspark ( .\\site-packages\\pyspark\\bin ) and specified HADOOP_HOME as .\\site-packages\\pyspark\\ .因此,我将 winutils.exe 复制到 pyspark ( .\\site-packages\\pyspark\\bin ) 的 bin 目录中,并将 HADOOP_HOME 指定为.\\site-packages\\pyspark\\ This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path .这解决了获取错误消息的问题: Failed to locate the winutils binary in the hadoop binary path

However, when I start a Spark session using pyspark I still get the following warning:但是,当我使用 pyspark 启动 Spark 会话时,我仍然收到以下警告:

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning.安装 Hadoop 然后为 HADDOP_HOME 指定其安装目录确实阻止了警告。 Has a specific hadoop version to be installed to make pyspark work without restrictions?是否要安装特定的 hadoop 版本以使 pyspark 不受限制地工作?

Hadoop installation is not mandatory. Hadoop 安装不是强制性的。

Spark is distributed computing engine only. Spark 只是分布式计算引擎。

Spark offers only computation & it doesn't have any storage. Spark 只提供计算,它没有任何存储。 But Spark is integrated with huge variety of storage systems like HDFS, Cassandra, HBase, Mongo DB, Local file system etc....但是 Spark 集成了种类繁多的存储系统,如 HDFS、Cassandra、HBase、Mongo DB、本地文件系统等......

Spark is designed to run on top of variety of resource management platforms like Spark, Mesos, YARN, Local, Kubernetes etc.... Spark 旨在运行在各种资源管理平台之上,例如 Spark、Mesos、YARN、Local、Kubernetes 等......

PySpark is Python API on top of Spark to develop Spark applications in Python. PySpark 是在 Spark 之上的 Python API,用于在 Python 中开发 Spark 应用程序。 So Hadoop installation is not mandatory.所以Hadoop安装不是强制性的。

Note: Hadoop Installation is only required either to run Pyspark application on top of YARN or to access input/output of Pyspark application from/to HDFS/Hive/HBase or Both.注意:Hadoop 安装只需要在 YARN 之上运行 Pyspark 应用程序或从/到 HDFS/Hive/HBase 或两者访问 Pyspark 应用程序的输入/输出。

About the warning you posted is normal one.关于您发布的警告是正常的。 So ignore it.所以忽略它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我需要测试功能的哪些方面? - What aspects of a function do I need to test? 如何安装pyspark以在独立脚本中使用? - How do I install pyspark for use in standalone scripts? 我是否需要降级我的 conda 版本才能安装模块? - Do I need to downgrade my conda version in order to install a module? Hadoop集群-运行作业之前,我是否需要在所有计算机上复制代码? - Hadoop cluster - Do I need to replicate my code over all machines before running job? 我真的需要 HDFS 用于 pyspark Partitionby() - Do i really need HDFS for pyspark Partitionby() 我需要什么软件包才能在Debian上安装scikit-learn - What packages do I need in order to install scikit-learn on Debian 我是否需要将dist文件夹提交到git才能使用git中的pip安装包? - Do I need to commit the dist folder to git in order to be able to install the package using pip from git? 我是否总是需要重建docker才能安装新的pip包? - Do I always need to rebuild docker in order to install new pip packages? 如何在新的Ubuntu实例上安装Hadoop和Pydoop - How do I install Hadoop and Pydoop on a fresh Ubuntu instance 在 Pyspark SQL 中你需要在哪里使用 lit()? - Where do you need to use lit() in Pyspark SQL?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM