简体   繁体   中英

parsing xml in python tag specific

This is my input file

<datasource formatted-name='federated.1819qwi0hys5391dzxhl70o95li4' inline='true' source-platform='win' version='18.1' xmlns:user='http://www.tableausoftware.com/xml/user'>
  <connection class='federated'>
    <named-connections>
      <named-connection caption='Sample - Superstore' name='excel.1ew9u4t0tggb9315darmm0nfz2kb'>
        <connection class='excel' driver='' filename='C:/Users/XXXX/Downloads/Sample - Superstore.xls' header='yes' imex='1' password='' server='' />
      </named-connection>
    </named-connections>
    <relation connection='excel.1ew9u4t0tggb9315darmm0nfz2kb' name='Custom SQL Query' type='text'>SELECT [Orders$].[Category] AS [Category],&#13;&#10;  [Orders$].[City] AS [City],&#13;&#10;  [Orders$].[Country] AS [Country],&#13;&#10;  [Orders$].[Customer ID] AS [Customer ID],&#13;&#10;  [Orders$].[Customer Name] AS [Customer Name],&#13;&#10;  [Orders$].[Discount] AS [Discount],&#13;&#10;  [Orders$].[Profit] AS [Profit],&#13;&#10;  [Orders$].[Quantity] AS [Quantity],&#13;&#10;  [Orders$].[Region] AS [Region],&#13;&#10;  [Orders$].[State] AS [State],&#13;&#10;  [People$].[Person] AS [Person],&#13;&#10;  [People$].[Region] AS [Region (People)]&#13;&#10;FROM [Orders$]&#13;&#10;  INNER JOIN [People$] ON [Orders$].[Region] = [People$].[Region]</relation>
    <metadata-records>
      <metadata-record class='column'>
        <remote-name>Category</remote-name>
        <remote-type>130</remote-type>
        <local-name>[Category]</local-name>
        <parent-name>[Custom SQL Query]</parent-name>
        <remote-alias>Category</remote-alias>
        <ordinal>1</ordinal>
        <local-type>string</local-type>
        <aggregation>Count</aggregation>
        <contains-null>true</contains-null>
        <collation>LEN_RUS_S2_WO</collation>
        <attributes>
          <attribute datatype='string' name='DebugRemoteType'>&quot;WSTR&quot;</attribute>
        </attributes>
      </metadata-record>

I want to get the attribute tag. I Have tried

for x in xmlRoot.findall('./metadata-record'):
            sqlString=x.find('attribute').text

but im getting only space as result. I have changed all the possible combinations in findall, still not able to get the result. I want to read that attribute tag dynamically and write in the output file as same. I have retrived the other tags from metadata-record but this alone not working. Can some one help??

My expected output is

<metadata-records>
      <metadata-record class='column'>
        <remote-name>Category</remote-name>
        <remote-type>130</remote-type>
        <local-name>[Category]</local-name>
        <parent-name>[Custom SQL Query]</parent-name>
        <remote-alias>Category</remote-alias>
        <ordinal>1</ordinal>
        <local-type>string</local-type>
        <aggregation>Count</aggregation>
        <contains-null>true</contains-null>
        <collation>LEN_RUS_S2_WO</collation>
        <attributes>
          <attribute datatype='string' name='DebugRemoteType'>&quot;WSTR&quot;</attribute>
        </attributes>
      </metadata-record>

I have retrieved till collation tag but do not know how to get the attributes tag. Can someone help??

Thanks, Aarush

Fix XML

First, I would fix the input file. It is not a good xml as it is missing some closing tags.

I fixed it for you here

<datasource formatted-name='federated.1819qwi0hys5391dzxhl70o95li4' inline='true' source-platform='win' version='18.1' xmlns:user='http://www.tableausoftware.com/xml/user'>
  <connection class='federated'>
    <named-connections>
      <named-connection caption='Sample - Superstore' name='excel.1ew9u4t0tggb9315darmm0nfz2kb'>
        <connection class='excel' driver='' filename='C:/Users/XXXX/Downloads/Sample - Superstore.xls' header='yes' imex='1' password='' server='' />
      </named-connection>
    </named-connections>
    <relation connection='excel.1ew9u4t0tggb9315darmm0nfz2kb' name='Custom SQL Query' type='text'>SELECT [Orders$].[Category] AS [Category],&#13;&#10;  [Orders$].[City] AS [City],&#13;&#10;  [Orders$].[Country] AS [Country],&#13;&#10;  [Orders$].[Customer ID] AS [Customer ID],&#13;&#10;  [Orders$].[Customer Name] AS [Customer Name],&#13;&#10;  [Orders$].[Discount] AS [Discount],&#13;&#10;  [Orders$].[Profit] AS [Profit],&#13;&#10;  [Orders$].[Quantity] AS [Quantity],&#13;&#10;  [Orders$].[Region] AS [Region],&#13;&#10;  [Orders$].[State] AS [State],&#13;&#10;  [People$].[Person] AS [Person],&#13;&#10;  [People$].[Region] AS [Region (People)]&#13;&#10;FROM [Orders$]&#13;&#10;  INNER JOIN [People$] ON [Orders$].[Region] = [People$].[Region]
    </relation>
  </connection>
    <metadata-records>
      <metadata-record class='column'>
        <remote-name>Category</remote-name>
        <remote-type>130</remote-type>
        <local-name>[Category]</local-name>
        <parent-name>[Custom SQL Query]</parent-name>
        <remote-alias>Category</remote-alias>
        <ordinal>1</ordinal>
        <local-type>string</local-type>
        <aggregation>Count</aggregation>
        <contains-null>true</contains-null>
        <collation>LEN_RUS_S2_WO</collation>
        <attributes>
          <attribute datatype='string' name='DebugRemoteType'>&quot;WSTR&quot;</attribute>
        </attributes>
      </metadata-record>
    </metadata-records>
</datasource>

Now to use minidom to traverse the XML

  1. import the minidom module from xml.dom
  2. parse the xml (I just saved it to my file system as x.xml )
  3. Get the element you are looking for with getElementsByTagName

Here is my code

from xml.dom import minidom

mydoc = minidom.parse('x.xml')

items = mydoc.getElementsByTagName('attribute')

print(items)

print(items) will print the object [<DOM Element: attribute at 0x10aad6690>] To get the values inside, you need to print the contents of this object which is a nodelist. Do this to get the value between the tags

# Traverse the childNodes of the tag
for t in items[0].childNodes:
    # if the node is a text node then print it
    if t.nodeType == t.TEXT_NODE:
        print(t.nodeValue)

One Liner

print(''.join((t.nodeValue for t in items[0].childNodes if t.nodeType == t.TEXT_NODE)))

This page really helped me get started with XML parsingReference page

Using xml.etree.ElementTree , you can try something like this:

import xml.etree.ElementTree as ET

xmlRoot = ET.fromstring(xml)
print(''.join([ET.tostring(x, encoding="unicode") for x in xmlRoot.findall('.//metadata-records//*')]))

Where xml is your xml input data.

Key is the findall : It looks from the root for any subelement called metadata-records and from that it just looks for any element.

The double forward slash // makes sure not only direct children are found, but any descendant of the metadata-records element. That is why you did find the <attributes> element (child), but failed to find the <attribute> element (child of child)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM