简体   繁体   English

如何从 Azure Data Lake Gen2 访问 XML 文件并将其转换为 Azure Databricks 中的数据帧?

[英]How to access XML file from Azure Data Lake Gen2 and transform it into data-frame in Azure Databricks?

we need to access the XML file located in Azure Data Lake Gen2 and Transform it into a dataframe as shown below.我们需要访问位于 Azure Data Lake Gen2 中的 XML 文件并将其转换为如下所示的数据帧。

Sample XML data:示例 XML 数据:

<SOAP-ENV:Envelope
   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">

<SOAP-ENV:Body>
           <ns2:getProjectsResponse
               xmlns:ns2="http://www.logic8.com/eq/webservices/generated">
               <ns2:Project>
                   <ns2:fileName>P10001</ns2:fileName>
                   <ns2:alias>project1</ns2:alias>
               </ns2:Project>
               <ns2:Project>
                   <ns2:fileName>P10002</ns2:fileName>
                   <ns2:alias>project2</ns2:alias>
               </ns2:Project>
       <ns2:Project>
                   <ns2:fileName>P10003</ns2:fileName>
                   <ns2:alias>project3</ns2:alias>
               </ns2:Project>
           </ns2:getProjectsResponse>
       </SOAP-ENV:Body>
   </SOAP-ENV:Envelope>

Expected Dataframe output:预期数据帧输出:

在此处输入图片说明

Can anyone help me on this.谁可以帮我这个事。

Firstly, you need to learn read data from Azure Data Lake Gen2 to Azure databricks.首先,您需要学习从 Azure Data Lake Gen2 读取数据到 Azure databricks。

There are many tutorials you can learn from:您可以从许多教程中学习:

  1. Databricks: Importing data from a Blob storage . Databricks:从 Blob 存储导入数据 This blogpost is about importing data from a Blob storage to Azure databricks.这篇博文是关于将数据从 Blob 存储导入到 Azure 数据块。
  2. Databricks Azure Blob Storage : This article explains how to access Azure Blob storage by mounting storage using DBFS or directly using APIs. Databricks Azure Blob 存储:本文介绍了如何通过使用 DBFS 或直接使用 API 挂载存储来访问 Azure Blob 存储。

Secondly, about the xml data type, you need to use the use the databricks spark-xml library which @Axel R has provided in comment.其次,关于xml数据类型,您需要使用@Axel R在评论中提供的databricks spark-xml库

  1. Import the spark-xml library into your workspace https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)将 spark-xml 库导入您的工作区https://docs.databricks.com/user-guide/libraries.html#create-a-library (在 maven/spark 包部分中搜索 spark-xml 并导入它)
  2. Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster将库附加到您的集群https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
  3. Use the following code in your notebook to read the xml file, where "note" is the root of the xml file.在您的笔记本中使用以下代码读取 xml 文件,其中“note”是 xml 文件的根。

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

Please reference: How can I read a XML file Azure Databricks Spark .请参考: 如何读取 XML 文件 Azure Databricks Spark

Combine these documents, I think you can figure out you problem.结合这些文件,我想你可以找出你的问题。 I don't know much about Azure databricks, I'm sorry that I can't test for you.我对 Azure 数据块了解不多,很抱歉无法为您测试。

Hope this helps.希望这可以帮助。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure Data Lake Storage Gen2 创建目录(如果 python 中不存在) - Azure Data Lake Storage Gen2 create directory if not exists in python 通过 python 检查 azure 数据湖存储 gen2 中是否存在文件 - check a file exists in azure data lake storage gen2 via python .jpg 文件未从 blob 存储(Azure 数据湖)加载到数据块中 - .jpg file not loading in databricks from blob storage (Azure data lake) 从 Azure Databricks 读取 Azure Datalake Gen2 映像 - Read Azure Datalake Gen2 images from Azure Databricks 如何列出另一个订阅中的另一个 Azure 数据湖 gen2 存储帐户中的所有文件和子目录 - How to list all files and subdirectories inside another Azure Data lake gen2 storage account which is in different subscription 如何使用 python 从 Azure Data Lake Gen 2 读取文件 - How can i read a file from Azure Data Lake Gen 2 using python 对于 Python 3.8 Azure 数据湖 Gen 2,如何检查文件系统上是否存在文件? - For Python 3.8 Azure data lake Gen 2, how do I check if a file exists on a filesystem? 无法从 azure 数据块中将文件保存在 azure 数据湖中 - Failed to save a file in azure data lake from azure data bricks 如何使用来自Azure文件共享的多个线程将数据复制到Azure Data Lake存储? - How to copy data to Azure Data Lake store using multiple threads from azure file share? 将附加文本文件从数据块写入 azure adls gen1 - writing appending text file from databricks to azure adls gen1
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM