简体   繁体   中英

Extract specific XML tags Values in python

I have a XML file which contains tags like these.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DataFlows>
    <DataFlow id="ABC">
            <Flow name="flow4" type="Ingest">
                <Ingest dataSourceName="type1" tableName="table1">
                    <DataSet>
                        <DataSetRef>value1-${d1}-${t1}</DataSetRef>
                        <DataStore>ingest</DataStore>
                    </DataSet>
                    <Mode>Overwrite</Mode>
                </Ingest>
            </Flow>    
        </DataFlow>
        <DataFlow id="MHH" dependsOn="ABC">
            <Flow name="flow5" type="Reconcile">
                <Reconciliation>
                    <Source>QW</Source>
                    <Target>EF</Target>
                    <ComparisonKey>
                        <Column>dealNumber</Column>
                    </ComparisonKey>
    <ReconcileColumns mode="required">
                        <Column>bookId</Column>
                    </ReconcileColumns>
                </Reconciliation>
            </Flow>
            <Flow name="output" type="Export" format="Native">
                <Table publishToSQLServer="true">
                    <DataSet>
                        <DataSetRef>value4_${cob}_${ts}</DataSetRef>
                        <DataStore>recon</DataStore>
                        <Date>${run_date}</Date>
                    </DataSet>
                    <Mode>Overwrite</Mode>
                </Table>
            </Flow>
        </DataFlow>
</DataFlows>

I want to process this XML in python using Python Minimal DOM implementation. I need to extract information in DataSet Tag only when the Flow type in “Reconcile".

For Example:

If my Flow Type is "Reconcile" then i need to go to next Flow tag named "output" and extract values of DataSetRef,DataSource and Date tags.

So far i have tried below mentioned Code but i am getting blank values in all may fields.

#!/usr/bin/python

from xml.dom.minidom import parse

import xml.dom.minidom

# Open XML document using minidom parser

DOMTree = xml.dom.minidom.parse("Store.xml")

collection = DOMTree.documentElement

#if collection.hasAttribute("DataFlows"):

#   print "Root element : %s" % collection.getAttribute("DataFlows")

pretty = DOMTree.toprettyxml()

print "Collectio: %s" % collection

dataflows = DOMTree.getElementsByTagName("DataFlow")

# Print detail of each movie.

for dataflow in dataflows:

   print "*****dataflow*****"

   if dataflow.hasAttribute("dependsOn"):

      print "Depends On is present"

      flows = DOMTree.getElementsByTagName("Flow")

      print "flows"

      for flow in flows:

        print "******flow******"

        if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":

          flowByReconcileType = flow.getAttribute("type")

          TagValue = flow.getElementsByTagName("DataSet")

          print "Tag Value is %s" % TagValue

          print "flow type is: %s" % flowByReconcileType

From there onwards i need to pass these 3 values extracted above to Unix Shell scripts to process some directories. Any Help would be appreciated.

First of all check if your XML is well formatted. You are missing a root tag and you got wrong double quotes for example here <Flow name=“flow4" type="Ingest">

IN your code you are correctly grabbing the dataflows.

You don't need to query the DOMTree again for the flows, you can check every dataflow's flow by querying like this:

flows = dataflow.getElementsByTagName("Flow")

Your condition if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile": looks ok to me, so in order to get the next flow item you can do something like this always checking your index is inside the array.

for index, flow in enumerate(flows):
    if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":
        if index + 1 < len(flows):
            your_flow = flows[index + 1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM