简体   繁体   中英

Creating dictionary from XML file

I have and XML file that looks like this:

<?xml version="1.0" encoding ="utf8"?>
<rebase>
  <Organism>
    <Name>Aminomonas paucivorans</Name>
      <Enzyme>M1.Apa12260I</Enzyme>
        <Motif>GGAGNNNNNGGC</Motif>
      <Enzyme>M2.Apa12260I</Enzyme>
        <Motif>GGAGNNNNNGGC</Motif>
  </Organism>
  <Organism>
    <Name>Bacillus cellulosilyticus</Name>
      <Enzyme>M1.BceNI</Enzyme>
        <Motif>CCCNNNNNCTC</Motif>
      <Enzyme>M2.BceNI</Enzyme>
        <Motif>CCCNNNNNCTC</Motif>
  </Organism>

For each Organism there are multiple Enzymes and Motifs . Enzymes are unique but motifs can repeat. So I tried to create a dictionary with the enzyme as the key and the motif as the value. This is my code:

    import xml.etree.ElementTree as ET

    def lister():
        tree = ET.parse('rebase.xml')
        rebase = tree.getroot()

        data_dict = {}

        for each_organism in rebase.findall('Organism'):
            try:
                enzyme = each_organism.find('Enzyme').text
            except AttributeError:
                continue

            for motif in each_organism.findall('Motif'):
                motif = motif.text
                data_dict[enzyme] = motif
        return data_dict

However the dictionary seems to have omitted quite a few entries. I can seem to understand whats the issue. Any help will be appreciated.

EDIT

A user posted a solution , but then deleted it , however I could copy it in time:

for each_organism in rebase.findall('Organism'):
        try:
            enzyme = each_organism.find('Enzyme').text
        except AttributeError:
            continue
        data_dict[enzyme] = []
        for motif in each_organism.findall('Motif'):
            data_dict[enzyme].append(motif.text)
    return data_dict

However the dictionry returned in this case is wrong and heres why:

An enzyme - motif pair is unique. Such that 1 enzyme has 1 motif only. Through out my file an enzyme occurs only once, a motif can occur multiple times but it belongs to a different enzyme , so the pair is unique. What the code under EDIT does is this:

Assume and enzyme - M.APaI with motif GATC and another one M.APaII with motif TCAG . Both enzymes are pretty similar (differind only in the last character I ). The code binds both motifs to the 1st enzyme creating {M.ApaI :['GATC','TCAG']}

The first big problem I see is that you're only searching for the first Enzyme within any given Organism. If you wanted to find each incidence of Enzyme, you should use:

 for enzyme in each_organism.findall('Enzyme'):
     # add to dictionary here

The second problem is that the format of your XML doesn't match the data relations you seem to be building with your dictionary. Within the XML, Enzyme, Motif, and Name are all children of Organism, but you're assigning motif as a value associated with the enzyme key. You have no way of knowing, necessarily, when iterating through incidences of and which one should be associated with the other, because they're all jammed together without any logical separation in the object.

I could be misunderstanding your purpose here, but it seems like you'd be better served by constructing Organism and Enzyme class objects rather than to force two (apparently) unrelated concepts into a key-value relationship.

This could look like so, and encapsulate your fields:

class Organism:
    # where enzymes is an iterable of Enzyme
    def __init__(self, name, enzymes):
        self.name = name
        self.enzymes = enzymes

and your Enzyme object:

class Enzyme:
    # where motifs is an iterable of string
    def __init__(self, motifs):
        self.motifs = motifs

All this would still require some sort of change in your XML file. Unless you just parse it by line (which is decidedly not the point of XML), I can't think of any easy ways you'd be able to figure out which Motifs belong to which Enzyme right now.

Edit: seeing as you're asking about ways to just iterate fairly blindly through each Enzyme node, and assuming that you always have a single Name element, that you have one Motif for each Enzyme, and every element after Name is Enzymes then Motif (eg EMEM etc.) you should be able to do this:

i = 0
enzymes = []
motifs = []

for element in each_organism:
    # skip the first Name child
    if i == 0:
        continue
    # if we're at an odd index, indicating an enzyme
    if i % 2 == 1:
        enzymes.append(element.text)
    # if we're at an even index, indicating the related motif
    elif i % 2 == 0:
        motifs.append(element.text)

    i += 1

Then, presuming every assumption I laid out, and probably a couple more (I'm not even 100% sure etree always iterates elements top-down), hold true, any motif at any given index in motifs will belong to the enzyme at the same index in enzymes. In case I haven't already made it clear: this is incredibly brittle code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM