如何使用databricks中的pyspark将spark数据帧中的所有行数据提取到文件中

Question

I'm trying to fetch all rows data from spark dataframe to a file in databricks.我正在尝试将所有行数据从 spark 数据帧中提取到数据块中的文件中。 I'm able to write df data to a file with only few counts.我能够将 df 数据写入只有很少计数的文件。 Suppose if i'm getting the count in df as 100 , then in file its 50 count so it's skipping the data.How can i load completed data from dataframe to a file without skipping the data.假设如果我将 df 中的计数设为 100 ，那么在文件中它的计数为 50 ，因此它正在跳过数据。如何将已完成的数据从数据帧加载到文件中而不跳过数据。 I have created a udf that udf will open the file and append the data to it.I have called that udf in spark sql df.我创建了一个 udf，udf 将打开文件并将数据附加到它。我在 spark sql df 中调用了该 udf。

Can someone help me on this issue?有人可以帮助我解决这个问题吗？

Answer 1

I would advise against using a udf the way you are for a few reasons:出于以下几个原因，我建议您不要像现在这样使用 udf：

UDFs run on the worker nodes, so you would have multiple udfs, each writing a portion of your data to a local file. UDF 在工作节点上运行，因此您将有多个 udf，每个 udf 将您的数据的一部分写入本地文件。
Even if you have your UDF appending to a file in a shared location (like the DBFS), you still have multiple nodes writing to a file concurrently, which could lead to errors.即使您将 UDF 附加到共享位置（如 DBFS）中的文件，您仍然有多个节点同时写入文件，这可能会导致错误。
Spark already has a way to do this out of the box that you should take advantage of Spark 已经有一种开箱即用的方法，您应该利用它

To write a spark dataframe to a file in databricks: Use the Dataframe.write attribute ( Databricks docs ).要将 spark 数据帧写入 databricks 中的文件：使用 Dataframe.write 属性（ Databricks 文档）。 There are plenty of options, so should be able to do whatever you need ( Spark docs (this one is for CSVs))有很多选择，所以应该可以做任何你需要的事情（ Spark docs （这个是针对 CSV 的））

Note on partitions: Spark writes each partition of the DF in its own file, so you should use the coalesce function (warning: this is very slow with extremely large dataframes since spark has to fit the whole dataframe into memory on the driver node)关于分区的注意事项： Spark 将 DF 的每个分区写入自己的文件中，因此您应该使用合并函数（警告：这对于非常大的数据帧来说非常慢，因为 Spark 必须将整个数据帧放入驱动程序节点的内存中）

Note on File locations: The file path you give will be on the driver node, so unless you plan on reading it back with another script, you should start your path with "/dbfs" , which is mounted onto all of the nodes' file systems.This way, it is saved on the Databricks File System, which is accessible from any cluster in your databricks instance.关于文件位置的注意事项：您提供的文件路径将在驱动程序节点上，因此除非您打算用另一个脚本读回它，否则您应该以 "/dbfs" 开始您的路径，该路径安装在所有节点的文件上系统。通过这种方式，它保存在 Databricks 文件系统上，可以从您的 databricks 实例中的任何集群访问。 (It's also available to download using the Databricks CLI.) （也可以使用 Databricks CLI 下载。）

Full Example:完整示例：

df_to_write = my_df.select(<columns you want>)
df_to_write.coalesce(1).write.csv("/dbfs/myFileDownloads/dataframeDownload.csv")

如何使用databricks中的pyspark将spark数据帧中的所有行数据提取到文件中

问题描述

1 个解决方案

解决方案1
0 2020-10-05 18:49:14

如何使用databricks中的pyspark将spark数据帧中的所有行数据提取到文件中

问题描述

1 个解决方案

解决方案1 0 2020-10-05 18:49:14

解决方案1
0 2020-10-05 18:49:14