简体   繁体   中英

How to access XML file from Azure Data Lake Gen2 and transform it into data-frame in Azure Databricks?

we need to access the XML file located in Azure Data Lake Gen2 and Transform it into a dataframe as shown below.

Sample XML data:

<SOAP-ENV:Envelope
   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">

<SOAP-ENV:Body>
           <ns2:getProjectsResponse
               xmlns:ns2="http://www.logic8.com/eq/webservices/generated">
               <ns2:Project>
                   <ns2:fileName>P10001</ns2:fileName>
                   <ns2:alias>project1</ns2:alias>
               </ns2:Project>
               <ns2:Project>
                   <ns2:fileName>P10002</ns2:fileName>
                   <ns2:alias>project2</ns2:alias>
               </ns2:Project>
       <ns2:Project>
                   <ns2:fileName>P10003</ns2:fileName>
                   <ns2:alias>project3</ns2:alias>
               </ns2:Project>
           </ns2:getProjectsResponse>
       </SOAP-ENV:Body>
   </SOAP-ENV:Envelope>

Expected Dataframe output:

在此处输入图片说明

Can anyone help me on this.

Firstly, you need to learn read data from Azure Data Lake Gen2 to Azure databricks.

There are many tutorials you can learn from:

  1. Databricks: Importing data from a Blob storage . This blogpost is about importing data from a Blob storage to Azure databricks.
  2. Databricks Azure Blob Storage : This article explains how to access Azure Blob storage by mounting storage using DBFS or directly using APIs.

Secondly, about the xml data type, you need to use the use the databricks spark-xml library which @Axel R has provided in comment.

  1. Import the spark-xml library into your workspace https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
  2. Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
  3. Use the following code in your notebook to read the xml file, where "note" is the root of the xml file.

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

Please reference: How can I read a XML file Azure Databricks Spark .

Combine these documents, I think you can figure out you problem. I don't know much about Azure databricks, I'm sorry that I can't test for you.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM