简体   繁体   English

将数据从XML提取到字典中(以1行作为键,下一行作为项)

[英]Extracting data from XML into dictionary (1 line as key, next line as item)

I have a XML file with a header of summary information and a main data table after. 我有一个XML文件,带有摘要信息的标题和后面的主数据表。 I've got the main data table out and into a pd.df , but now I want to extract parts of the header information into a dictionary. 我已经将主数据表放到pd.df ,但是现在我想将标头信息的一部分提取到字典中。

Example of XML file: XML文件示例:

<Workbook>
 <Worksheet>
  <Tables>
   <Row>
    <Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Administrative Data</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">ID</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">B013</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Title</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Mr</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Last Name</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Data</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">First Name</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Test</Data></Cell>
   </Row>
   <Row/>
   <Row/>
   <Row>
    <Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Biological and Medical Baseline Data</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Height</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">176 cm</Data></Cell>
   </Row>
   <Row>
    <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Weight</Data></Cell>
    <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">56.9 kg</Data></Cell>
   </Row>
  </Tables>
 </Worksheet>
</Workbook>

What I want to be able to do is extract the data in the 'Administrative Data' section to a dictionary with the first 'cell' value as the key (if I can remove the spaces that would be awesome), and the second 'cell' value as the item. 我想要做的是将“管理数据”部分中的数据提取到字典中,其中第一个“单元格”值为键(如果我可以删除太棒的空格),第二个“单元格”值作为项目。 This then needs to be repeated so the data in the 'Biological and Medical Baseline Data' is held in a separate dictionary. 然后需要重复此操作,以便将“生物学和医学基准数据”中的数据保存在单独的词典中。 Dictionary names can be whatever (eg 'subject' and 'biomed') 字典名称可以是任意名称(例如“ subject”和“ biomed”)

Current code to parse the XML file and access the tags: 当前代码以解析XML文件并访问标签:

from lxml import etree

f_path = 'data store/cortex_full.xml'  # enter path of xml file

# open and parse xml file
with open(f_path, 'r', encoding='utf-8') as f:  # set encoding to utf-8 for mac
    root = etree.parse(f)

namespaces = {'o': 'urn:schemas-microsoft-com:office:office',
              'x': 'urn:schemas-microsoft-com:office:excel',
              'ss': 'urn:schemas-microsoft-com:office:spreadsheet'}


ws = root.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0:
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    if len(tables) > 0:
        rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
        for row in rows:
            cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)

Any suggestions on how to approach this? 关于如何处理此问题的任何建议? If a dictionary is not optional, also happy to work with other suggestions. 如果字典不是可选的,也很乐意与其他建议配合使用。

Make sure you declare the below well before 确保您在声明以下内容之前

subject={}
bio={}
d=None  #If this doesn't work then use d={}

And consider replacing 并考虑更换

for row in rows:
   cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)

with

        for row in rows:
            cells = row.xpath('./ss:Cell', namespaces=namespaces)
            if(len(cells)==2):
              key=None
              item=None
              for cell in cells:
                if(cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=="HeadTableParameterName"):
                  key=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
                else:
                  item=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
              if(not(key==None or item==None)):
                d[key]=item
            elif len(cells)==1:
              if(cells[0].attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=='HeadTableTitle'):
                if(cells[0].xpath('./ss:Data', namespaces=namespaces)[0].text=='Biological and Medical Baseline Data'):
                  d=bio
                else:
                  d=subject
print(bio)
print(subject)

Although not necessary I've put in some checks just to give an idea, but you can extend the checks to make it more robust. 尽管没有必要我进行一些检查只是为了给出一个想法,但是您可以扩展检查以使其更强大。

Also I've a working version here . 这里也有一个工作版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM