繁体   English   中英

如何从 Python 中的 XML 文件创建字典的嵌套列表

[英]how to create nested list of dictionaries from XML file in Python

此 XML 样本代表来自HMDB Serum Metabolites数据集的样本代谢Serum Metabolites

<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
  <version>4.0</version>
  <creation_date>2005-11-16 15:48:42 UTC</creation_date>
  <update_date>2019-01-11 19:13:56 UTC</update_date>
  <accession>HMDB0000001</accession>
  <status>quantified</status>
  <secondary_accessions>
    <accession>HMDB00001</accession>
    <accession>HMDB0004935</accession>
    <accession>HMDB0006703</accession>
    <accession>HMDB0006704</accession>
    <accession>HMDB04935</accession>
    <accession>HMDB06703</accession>
    <accession>HMDB06704</accession>
  </secondary_accessions>
  <name>1-Methylhistidine</name>
  <cs_description>1-Methylhistidine, also known as 1-mhis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine has been found in human muscle and skeletal muscle tissues, and has also been detected in most biofluids, including cerebrospinal fluid, saliva, blood, and feces. Within the cell, 1-methylhistidine is primarily located in the cytoplasm. 1-Methylhistidine participates in a number of enzymatic reactions. In particular, 1-Methylhistidine and Beta-alanine can be converted into anserine; which is catalyzed by the enzyme carnosine synthase 1. In addition, Beta-Alanine and 1-methylhistidine can be biosynthesized from anserine; which is mediated by the enzyme cytosolic non-specific dipeptidase. In humans, 1-methylhistidine is involved in the histidine metabolism pathway. 1-Methylhistidine is also involved in the metabolic disorder called the histidinemia pathway.</cs_description>
  <description>One-methylhistidine (1-MHis) is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into b-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with deficient carnosinase activity in plasma show increased 1-MHis excretions when they consume a high meat diet. Reduced serum carnosinase activity is also found in patients with Parkinson's disease and multiple sclerosis and patients following a cerebrovascular accident. Vitamin E deficiency can lead to 1-methylhistidinuria from increased oxidative effects in skeletal muscle. 1-Methylhistidine is a biomarker for the consumption of meat, especially red meat.</description>
  <synonyms>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
    <synonym>1-Methylhistidine</synonym>
    <synonym>Pi-methylhistidine</synonym>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
    <synonym>1 Methylhistidine</synonym>
    <synonym>1-Methyl histidine</synonym>
  </synonyms>
  <chemical_formula>C7H11N3O2</chemical_formula>
  <smiles>CN1C=NC(C[C@H](N)C(O)=O)=C1</smiles>
  <inchikey>BRMWTNUJHUMWMS-LURJTMIESA-N</inchikey>
<diseases>
    <disease>
      <name>Kidney disease</name>
      <omim_id/>
      <references>
        <reference>
          <reference_text>McGregor DO, Dellow WJ, Lever M, George PM, Robson RA, Chambers ST: Dimethylglycine accumulates in uremia and predicts elevated plasma homocysteine concentrations. Kidney Int. 2001 Jun;59(6):2267-72.</reference_text>
          <pubmed_id>11380830</pubmed_id>
        </reference>
        <reference>
          <reference_text>Ehrenpreis ED, Salvino M, Craig RM: Improving the serum D-xylose test for the identification of patients with small intestinal malabsorption. J Clin Gastroenterol. 2001 Jul;33(1):36-40.</reference_text>
          <pubmed_id>11418788</pubmed_id>
        </reference>
      </references>
    </disease>
</diseases>

我想要做的是运行一个嵌套循环并创建一个字典列表。

每本词典将代表一种代谢物。

字典中的每个键都将被选择节点(按标签名称)。

键的值将是字符串列表或单个字符串。

这是我认为需要的结构(也欢迎更好的想法):

[  
    {
    "accession":"accession.value", 
    "name": "name.value",
    "synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ], 
    "chemical_formula":"chemical_formula.value", 
    "smiles": "smiles.value",
    "inchikey":"inchikey.value", 
    "biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
    "diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
    "pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    "pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    .
    .
    .
    }, 
    {"accession":"accession.value", 
    "name": "name.value",
    "synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ], 
    "chemical_formula":"chemical_formula.value", 
    "smiles": "smiles.value",
    "inchikey":"inchikey.value", 
    "biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
    "diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
    "pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    "pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    .
    .
    .
    },
    .
    .
    .
] 

这是我到目前为止所做的

# Import packges
from xml.dom import minidom
import xml.etree.ElementTree as et

# load data 
data1 = et.parse('D:/path/to/my/Projects/HMDB/DataSets/saliva_metabolites/saliva_metabolites.xml')

# create name space 
ns = {"h": "http://www.hmdb.ca"}

# extract the first 3 metabolites only for easy work
metabolites = root.findall('./h:metabolite', ns)   [0:3]

现在在 3 个代谢物上运行嵌套循环并选择特定节点(我需要的前 2 个)作为字典。

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts={"accession":  subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts = {"name": subchild.text}
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

我收到了这个输出:

>> print(newlist)
[{'name': '1-Methylhistidine'}, {'name': '2-Ketobutyric acid'}, {'name': '2-Hydroxybutyric acid'}]

代替

[{'accession': 'HMDB0000001','name': '1-Methylhistidine' },
 {'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
 {'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'}]

意味着<name>超过了<accession>

还尝试输入列表作为键的值

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        # if subchild.tag=='{http://www.hmdb.ca}accession':
        #     dicts={"accession":  subchild.text}
        # if subchild.tag == '{http://www.hmdb.ca}name':
        #     dicts = {"name": subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                dicts = {"synonyms": synonym.text}
                print(synonym.text)
            innerlist.append(subchild.text)
            print(innerlist)

    newlist.append(dicts)

            innerlist.append(subchild.text)

        newlist.append(innerlist)

输出再次被超越:

>> print(newlist)
[{'synonyms': '1-Methylhistidine dihydrochloride'},
 {'synonyms': 'alpha-Ketobutyric acid, sodium salt'},
 {'synonyms': '2-Hydroxybutyric acid, monosodium salt, (+-)-isomer'}]

上面 3 个键中的每一个都包含每个列表中的最后一个值,而不是一个值列表。

应该收到类似的东西(但每个同义词都有所有值):

>> print(newlist)
[{'synonyms': ['(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid',
               '1-Methylhistidine',
               ....
               '1-Methylhistidine dihydrochloride' ]},

 {'synonyms': ['2-Ketobutanoic acid',
               '2-Oxobutyric acid',
                ....
               'alpha-Ketobutyric acid, sodium salt']},

 {'synonyms': [ '2-Hydroxybutanoic acid',
                'alpha-Hydroxybutanoic acid',
                ....
                '2-Hydroxybutyric acid, monosodium salt, (+-)-isomer']}
]

我正在使用这些问题来编写循环:

  1. 创建字典 Python 列表- 我认为非常相似但无法使其工作
  2. 如何在 for 循环中创建和填充列表列表
  3. Python ElementTree - 按顺序遍历子节点和文本
  4. 使用 for 循环填充字典(python)[重复]
  5. 从 XML 文档生成嵌套列表

任何想法、提示、线索或想法将不胜感激

第一个代码片段的问题可能是将新字典重新分配给变量 dict:

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts={"accession":  subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}name':
           # here the old value of dict is overriden with new value
            dicts = {"name": subchild.text}
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

您可能应该使用 dict[key] = value 形式的赋值:

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts["accession"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts["name"] =  subchild.text
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

第二个代码片段似乎也有类似的问题:

newlist = []
for child in metabolites:
    dicts = {}
    innerlist = []
    for subchild in child:
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                innerlist.append(synonym.text)
    dicts["synonyms"] = innerlist

    newlist.append(dicts)

但是(正如已经指出的那样)您可以使用一些更方便的库,而不是手动解析 XML。

这是合并的脚本:

newlist = []
for child in metabolites:
    dicts = {}
    innerlist = []
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts["accession"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts["name"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                innerlist.append(synonym.text)
            dicts["synonyms"] = innerlist
    newlist.append(dicts)
   
print(newlist)

它输出以下结果:

[{'accession': 'HMDB0000001', 'name': '1-Methylhistidine', 'synonyms': ['(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid', '1-Methylhistidine', 'Pi-methylhistidine', '(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate', '1 Methylhistidine', '1-Methyl histidine']}]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM