将数据帧保存到pyspark中本地驱动器上的JSON文件

Question

I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. 我有一个数据帧，我试图使用pyspark 1.4保存为JSON文件，但它似乎没有工作。 When i give it the path to the directory it returns an error stating it already exists. 当我给它指向目录的路径时，它返回一个错误，表明它已经存在。 My assumption based off the documentation was that it would save a json file in the path that you give it. 我基于文档的假设是它会将json文件保存在您提供的路径中。

df.write.json("C:\Users\username")

Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files. 指定一个带有名称的目录不会产生任何文件，并给出错误“java.io.IOException：Mkdirs无法创建文件：/ C：Users / username / test / _temporary / .... etc。创建名称test的目录，其中包含几个带有空白crc文件的子目录。

df.write.json("C:\Users\username\test")

And adding a file extension of JSON, produces the same error 并添加JSON的文件扩展名，会产生相同的错误

df.write.json("C:\Users\username\test.JSON")

Answer 1

Could you not just use 你可以不只是使用

df.toJSON()

as shown here ? 如图所示这里？ If not, then first transform into a pandas DataFrame and then write to json. 如果没有，那么首先转换为pandas DataFrame，然后写入json。

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")

Answer 2

When working with large data converting pyspark dataframe to pandas is not advisable. 使用大数据时，将pyspark数据帧转换为pandas是不可取的。 you can use below command to save json file in output directory. 您可以使用以下命令将json文件保存在输出目录中。 Here df is pyspark.sql.dataframe.DataFrame. 这里df是pyspark.sql.dataframe.DataFrame。 Part file will be generated inside the output directory by the cluster. 部件文件将由集群在输出目录中生成。

df.coalesce(1).write.format('json').save('/your_path/output_directory')

Answer 3

I would avoid using write.json since its causing problems on Windows. 我会避免使用write.json因为它在Windows上引起了问题。 Using Python's file writing should skip creating the temp directories that are giving you issues. 使用Python的文件编写应该跳过创建提供问题的临时目录。

with open("C:\\Users\\username\\test.json", "w+") as output_file:
    output_file.write(df.toJSON())

将数据帧保存到pyspark中本地驱动器上的JSON文件

问题描述

3 个解决方案

解决方案1
4 已采纳 2015-06-29 14:39:43

解决方案2
1 2019-01-10 15:54:59

解决方案3
0 2015-06-29 14:16:59

将数据帧保存到pyspark中本地驱动器上的JSON文件

问题描述

3 个解决方案

解决方案1 4 已采纳 2015-06-29 14:39:43

解决方案2 1 2019-01-10 15:54:59

解决方案3 0 2015-06-29 14:16:59

解决方案1
4 已采纳 2015-06-29 14:39:43

解决方案2
1 2019-01-10 15:54:59

解决方案3
0 2015-06-29 14:16:59