简体   繁体   English

Databricks 未在 blob 存储中将数据帧正确保存为 Parquet

[英]Databricks not saving dataframes as Parquet properly in the blob storage

I am using Databricks with a mounted blob storage.我正在使用带有已安装 blob 存储的 Databricks。 When I execute my Python notebook which creates large pandas DataFrame and tries to store them as.parquet files they show up having 0 bytes.当我执行我的 Python 笔记本创建大 pandas DataFrame 并尝试将它们存储为 .parquet 文件时,它们显示为 0 字节。

The saving takes place in a submodule that I import and not in the main notebook itself.保存发生在我导入的子模块中,而不是在主笔记本本身中。 The strange this is that saving the dataframe as a parquet file always stores it as an empty file, ie with 0 bytes.奇怪的是,将 dataframe 保存为 parquet 文件总是将其存储为空文件,即 0 字节。 However, if I try to save a dataframe as a.parquet file in the main notebook itself, it works.但是,如果我尝试将 dataframe 保存为主笔记本本身中的 .parquet 文件,它就可以工作。

The problem seems to be very similar to this issue: https://community.databricks.com/s/question/0D58Y00009MIWkfSAH/how-can-i-save-a-parquet-file-using-pandas-with-a-data-factory-orchestrated-notebook这个问题似乎与这个问题非常相似: https://community.databricks.com/s/question/0D58Y00009MIWkfSAH/how-can-i-save-a-parquet-file-using-pandas-with-a-data -工厂编排笔记本

I have installed both pyarrow and pandas and try to save a dataframe as follows:我已经安装了 pyarrow 和 pandas 并尝试保存 dataframe 如下:

df.to_parquet("blob storage location.parquet", index=False, engine="pyarrow")

Everything works fine locally but running this in Databricks is causing issues.在本地一切正常,但在 Databricks 中运行它会导致问题。 I first tried to save my dataframes as HDF5 files, but the saving process doesn't work in Databricks it seems.我首先尝试将我的数据帧保存为 HDF5 文件,但保存过程似乎在 Databricks 中不起作用。 I then switched to Parquet but I am running into the issue mentioned below.然后我切换到 Parquet,但我遇到了下面提到的问题。

Does anyone have a solution or an explanation as to why this is happening?有没有人有解决方案或解释为什么会这样?

I tried to reproduce the same in my environment and I got below results :我试图在我的环境中重现相同的内容,但得到以下结果

This is my sample mount location path /mnt/io243这是我的示例挂载位置路径/mnt/io243

dbutils.fs.mount(
    source = "wasbs://<container_Name>@<storage_account_name>.blob.core.windows.net/",
    mount_point = "/mnt/<mount_name>",
    extra_configs = {"fs.azure.account.key.<storage_account_name>.blob.core.windows.net":"Access_key"})

在此处输入图像描述

NOTE: As you can see, this is my mount path /mnt/io243 .If I used the same mount path without adding /dbfs it stores as an empty file.注意:如您所见,这是我的挂载路径/mnt/io243 。如果我使用相同的挂载路径而不添加/dbfs ,它将存储为一个空文件。 So Use mount path like this Syntax: /dbfs/mnt/io243/<file_name>.parquet .Make sure to install fsspec use this command %pip install fsspec .所以使用像这样的安装路径语法:/ /dbfs/mnt/io243/<file_name>.parquet /mnt/io243/<file_name>.parquet。确保使用此命令%pip install fsspec

I successfully got the file into the destination location using the below code.我使用以下代码成功地将文件放入目标位置。

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

import pandas as pd
#%pip install fsspec

#sample datafram
my_data = [
            ("vamsi","1","M",2000),
            ("saideep","2","M",3000),
            ("rakesh","3","M",4000)
          ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])

df = spark.createDataFrame(data=my_data,schema=schema)

df1 = df.toPandas()

df1.to_parquet("/dbfs/mnt/io243/def1.parquet",index=False, engine="pyarrow")

在此处输入图像描述

Yes, You can check whether inside mount location file is stored or not.是的,您可以检查是否存储了内部安装位置文件。 Please follow this code.请遵循此代码。

dbutils.fs.ls('<mount_path>')

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Matplotlib Output 保存到 Databricks 上的 Blob 存储 - Saving Matplotlib Output to Blob Storage on Databricks Python Databricks:有没有办法读取保存在 blob 存储中的 tar.gz 文件夹中的文本文件? - Python Databricks: Is there any way to read text files inside a tar.gz folder saved in a blob storage? 逻辑应用程序 HTTP PUT 保存到 Blob 存储保存为错误的内容类型 - Logic Apps HTTP PUT to save to Blob Storage is saving as wrong Content-Type 创建表时,哪个 Parquet 文件数据块将引用架构 - Which Parquet file databricks will refer for schema while creating a table 测试 Azure Blob 存储 SDK - Testing Azure Blob Storage SDK Blob 存储 - 直接在其上处理文件 - Blob Storage - handle files directly on it “Cloud Storage 上的 Parquet 文件到 Cloud Bigtable”DataFlow 模板无法读取 parquet 文件 - The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files blob 中的通配符以路径结尾 - ADF 触发 blob 存储事件 - Wildcard in blob ends with path - ADF trigger blob storage event 保存文件 parquet pyspark 时出现 java.lang.StackOverflowError - java.lang.StackOverflowError when saving file parquet pyspark 将 blob 发布到谷歌云存储 - post blob to google cloud storage
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM