[英]extract specific cell data from XML .xls file with python
I have a giant XML file that is exported from a device as a .xls file. 我有一个巨大的XML文件,该文件从设备作为.xls文件导出。
<?xml version='1.0'?>
<?mso-application progid='Excel.Sheet'?>
<s:Workbook xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:s="urn:schemas-microsoft-com:office:spreadsheet">
<s:Styles>
...
<s:Worksheet s:Name="Description">
...
<s:Worksheet s:Name="Data">
<s:Table s:DefaultColumnWidth="100">
<s:Row>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Time</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Temp1</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Temp2</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Liquid</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Response</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Base</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Events</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">Low</s:Data>
</s:Cell>
<s:Cell s:StyleID="Bold">
<s:Data s:Type="String">High</s:Data>
</s:Cell>
<s:Cell />
</s:Row>
...
<s:Row>
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">45</s:Data> # Time
</s:Cell>
# There is no Temp1 data
<s:Cell />
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">29.74</s:Data> # Temp2
</s:Cell>
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">12.11</s:Data> # Liquid
</s:Cell>
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">100</s:Data> # Response
</s:Cell>
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">30</s:Data> # Base
</s:Cell>
# There are no events in this data
<s:Cell />
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">0</s:Data> # Low
</s:Cell>
<s:Cell s:StyleID="Default">
<s:Data s:Type="Number">55</s:Data> # High
</s:Cell>
<s:Cell />
</s:Row>
What I am trying to do is extract information from the worksheet named "Data." 我想做的是从名为“数据”的工作表中提取信息。 There are 9 headers for the data, but I am only interested in the data that corresponds to "Time" and "Temp2", which would be "45" and "29.74", respectively.
该数据有9个标题,但我只对与“ Time”和“ Temp2”相对应的数据感兴趣,它们分别为“ 45”和“ 29.74”。
I have managed to figure out how to navigate the file using: 我设法弄清楚如何使用以下方法导航文件:
import xml.etree.ElementTree as ET
tree = ET.parse('xmlfile')
root = tree.getroot()
ns = {'x':'urn:schemas-microsoft-com:office:excel',
'o':'urn:schemas-microsoft-com:office:office',
's':'urn:schemas-microsoft-com:office:spreadsheet'}
root.findall('./s:Worksheet/s:Table/s:Row/s:Cell/s:Data', namespaces=ns)
The closest I have gotten to getting the data out of the cells is using an example I found in another post, and trying variations of the following: 我最接近从单元格中获取数据的方法是使用我在另一篇文章中找到的示例,并尝试以下操作:
for elem in xmlTree.iter():
if elem.text != None:
print(elem.text)
This outputs everything (all 18901 rows of data), and I do not really know how to proceed from here. 这将输出所有内容 (所有18901行数据),我真的不知道如何从这里继续。 Ultimately what I would like to do is to store this data in a data frame or something equivalent so that I may plot it.
最终,我想做的就是将这些数据存储在数据框中或类似的东西中,以便进行绘图。
This may be a naive suggestion, but have you tried simply using Pandas (after installing the package, of course)? 这可能是一个幼稚的建议,但是您是否尝试过简单地使用Pandas(当然,在安装软件包之后)?
import pandas
df = pandas.read_excel(excel_file)
# ... analyze and plot from the DataFrame
(This could have been a comment, but I'm not allowed to comment yet...) (这本来可以是评论,但我还不能评论...)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.