如何使用Python在HDFS中打开实木复合地板文件？

Question

I am looking to read a parquet file that is stored in HDFS and I am using Python to do this. 我希望读取存储在HDFS中的实木复合地板文件，并且正在使用Python进行此操作。 I have this code below but it does not open the files in HDFS. 我在下面有此代码，但无法在HDFS中打开文件。 Can you help me change the code to do this? 您能帮我更改代码来做到这一点吗？

sc = spark.sparkContext

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

Also, I am looking to save the Dataframe as a CSV file as well. 另外，我也希望将数据框另存为CSV文件。

Answer 1

have a try with 尝试一下

sqlContext.read.parquet("hdfs://<host:port>/path-to-file/commentClusters.parquet")

To find out the host and port, just search for the file core-site.xml and look for xml element fs.defaultFS (eg $HADOOP_HOME/etc/hadoop/core-site.xml) 要查找主机和端口，只需搜索文件core-site.xml并查找xml元素fs.defaultFS（例如$ HADOOP_HOME / etc / hadoop / core-site.xml）

To make it simple, try 为了简单起见，请尝试

sqlContext.read.parquet("hdfs:////path-to-file/commentClusters.parquet")

or 要么

sqlContext.read.parquet("hdfs:/path-to-file/commentClusters.parquet")

Referring Cannot Read a file from HDFS using Spark 引用无法使用Spark从HDFS读取文件

To save as csv, try 要另存为csv，请尝试

df_result.write.csv(path=res_path) # possible options: header=True, compression='gzip'

如何使用Python在HDFS中打开实木复合地板文件？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-01 21:56:13

如何使用Python在HDFS中打开实木复合地板文件？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-01 21:56:13

解决方案1
1 已采纳 2018-02-01 21:56:13