解析 XML 并将数据导入 Pandas 数据帧时遇到问题

Question

I am trying to import data from a XML file that contains breath-by-breath data from an exercise test.我正在尝试从包含来自运动测试的逐呼吸数据的 XML 文件导入数据。 the XML structure is as follows (simplified to show the general structure): XML 结构如下（简化显示一般结构）：

<?xml version="1.0"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
   xmlns:html="http://www.w3.org/TR/REC-html40">
    <Worksheet ss:Name="MetasoftStudio">
        <Table ss:ExpandedColumnCount="21" ss:ExpandedRowCount="458" x:FullColumns="1"    x:FullRows="1" ss:StyleID="s62" ss:DefaultColumnWidth="53">
            <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
            <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="97"/>
            <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
            <Row ss:AutoFitHeight="0" ss:Height="26">
                <Cell ss:StyleID="Default"><Data ss:Type="String">t</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">Phase</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">Marker</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/kg</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/HR</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">HR</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">WR</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'O2</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'CO2</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">RER</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">V'E</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">BF</Data></Cell>
            </Row>
            <Row ss:Height="15">
                <Cell ss:StyleID="Default"><Data ss:Type="String">h:mm:ss</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">ml/min/kg</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">ml</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">W</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
            </Row>
            <Row ss:Height="15">
                <Cell ss:StyleID="Default"><Data ss:Type="String">0:00:06</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String">Rest</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">0.27972413565454501</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">4.3706896196022598</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">4.5856415681072953</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">61</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">0</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">27.002532271037801</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">26.4113108545688</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">1.0223851598932201</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">10.155340000000001</Data></Cell>
                <Cell ss:StyleID="Default"><Data ss:Type="Number">18.07</Data></Cell>
            </Row>
        </Table>
    </Worksheet>
</Workbook>

I have used lxml to parse and iterate over the XML file then extracted the 'data' in each 'cell' appending it to a list, and then appending that list to a parent list (giving me a nested list of each row) using the code:我使用lxml来解析和迭代 XML 文件，然后提取每个“单元格”中的“数据”，将其附加到一个列表中，然后将该列表附加到一个父列表中（给我一个嵌套的每行列表）使用代码：

from lxml import etree, objectify
import pandas as pd

with open('Python/cortex.xml') as infile:
    xml_file = infile.read()

    root = objectify.fromstring(xml_file)

    header = []
    data = []

    for row in root.Worksheet.Table.getchildren():
        temp_row = []
        if not row.tag == '{urn:schemas-microsoft-com:office:spreadsheet}Column':
            for cell in row.getchildren():
                temp_row.append(cell.Data)
            data.append(temp_row)
    header = data.pop(0) #remove the first 'row' and store in header list
    del data[0] #remove 2nd line of superfluous data

The first row gives the headers, hence I pop that into its own list, and row 2 contains the units for each variable, so I just get rid of that.第一行给出了标题，因此我把它pop到它自己的列表中，第 2 行包含每个变量的单位，所以我去掉了它。 All working well so far (or so it seemed)...到目前为止一切正常（或者看起来如此）......

Now I need to get it into a pd dataframe to start working with it.现在我需要将它放入一个 pd 数据框以开始使用它。 If I go df = pd.DataFrame(data, columns=header) and I print(df) i get: ValueError: Buffer has wrong number of dimensions (expected 1, got 32)如果我去df = pd.DataFrame(data, columns=header)并print(df)我得到： ValueError: Buffer has wrong number of dimensions (expected 1, got 32)

Ok not sure what happened there... If I make the df without assigning the header and print that I get:好的，不确定那里发生了什么......如果我在没有分配标题的情况下制作 df 并打印，我得到：

              0           1       2                        3   \
0  [[[0:00:06]]]  [[[Rest]]]  [[[]]]  [[[0.279724135654545]]]   
1  [[[0:00:09]]]  [[[Rest]]]  [[[]]]  [[[0.465136232899829]]]   
2  [[[0:00:13]]]  [[[Rest]]]  [[[]]]  [[[0.357975433456662]]]   
3  [[[0:00:19]]]  [[[Rest]]]  [[[]]]  [[[0.543332419057909]]]   
4  [[[0:00:24]]]  [[[Rest]]]  [[[]]]  [[[0.374604578743889]]]

That doesn't look right!那看起来不对啊！ Where did all these lists in lists in lists come from!列表中列表中的所有这些列表是从哪里来的！ If I iterate over and print the nested list data , it prints perfectly, but once I try to convert it to a df something goes wrong.如果我迭代并打印嵌套列表data ，它会完美打印，但是一旦我尝试将其转换为 df ，就会出现问题。

Can anyone enlighten me as to what has happened and how I can get the data into the pd df?任何人都可以启发我了解发生了什么以及如何将数据输入到 pd df 中？ If there is a better method than how I've done it, then I am happy to give it a go.如果有比我做的更好的方法，那么我很乐意试一试。

Answer 1

You can create list of lists and then DataFrame by constructor.您可以通过构造函数创建列表列表，然后创建DataFrame 。 For parsing is used this solution :对于解析使用此解决方案：

from lxml import etree

with (open('test.xml','r')) as f:
    doc = etree.parse(f)

namespaces={'o':'urn:schemas-microsoft-com:office:office',
            'x':'urn:schemas-microsoft-com:office:excel',
            'ss':'urn:schemas-microsoft-com:office:spreadsheet'}

L = []
ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0: 
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    if len(tables) > 0: 
        rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
        for row in rows:
            tmp = []
            cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
            for cell in cells:
#                print(cell.text);
                tmp.append(cell.text)
            L.append(tmp)
print (L)

[['t', 'Phase', 'Marker', "V'O2", "V'O2/kg", "V'O2/HR", 'HR', 'WR', 
  "V'E/V'O2", "V'E/V'CO2", 'RER', "V'E", 'BF'], 
 ['h:mm:ss', None, None, 'L/min', 'ml/min/kg', 'ml', 
 '/min', 'W', None, None, None, 'L/min', '/min'], 
 ['0:00:06', 'Rest', None, '0.27972413565454501', '4.3706896196022598',
  '4.5856415681072953', '61', '0', '27.002532271037801', '26.4113108545688', 
  '1.0223851598932201', '10.155340000000001', '18.07']]

df = pd.DataFrame(L[2:], columns=L[0])
print (df)
         t Phase Marker                 V'O2             V'O2/kg  \
0  0:00:06  Rest   None  0.27972413565454501  4.3706896196022598   

              V'O2/HR  HR WR            V'E/V'O2         V'E/V'CO2  \
0  4.5856415681072953  61  0  27.002532271037801  26.4113108545688   

                  RER                 V'E     BF  
0  1.0223851598932201  10.155340000000001  18.07

解析 XML 并将数据导入 Pandas 数据帧时遇到问题

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-09-24 07:18:40

解析 XML 并将数据导入 Pandas 数据帧时遇到问题

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-09-24 07:18:40

解决方案1
1 已采纳 2017-09-24 07:18:40