[英]Extracting data from XML into dictionary (1 line as key, next line as item)
I have a XML
file with a header of summary information and a main data table after. 我有一个
XML
文件,带有摘要信息的标题和后面的主数据表。 I've got the main data table out and into a pd.df
, but now I want to extract parts of the header information into a dictionary. 我已经将主数据表放到
pd.df
,但是现在我想将标头信息的一部分提取到字典中。
Example of XML
file: XML
文件示例:
<Workbook>
<Worksheet>
<Tables>
<Row>
<Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Administrative Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">ID</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">B013</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Title</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Mr</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Last Name</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">First Name</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Test</Data></Cell>
</Row>
<Row/>
<Row/>
<Row>
<Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Biological and Medical Baseline Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Height</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">176 cm</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Weight</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">56.9 kg</Data></Cell>
</Row>
</Tables>
</Worksheet>
</Workbook>
What I want to be able to do is extract the data in the 'Administrative Data' section to a dictionary with the first 'cell' value as the key (if I can remove the spaces that would be awesome), and the second 'cell' value as the item. 我想要做的是将“管理数据”部分中的数据提取到字典中,其中第一个“单元格”值为键(如果我可以删除太棒的空格),第二个“单元格”值作为项目。 This then needs to be repeated so the data in the 'Biological and Medical Baseline Data' is held in a separate dictionary.
然后需要重复此操作,以便将“生物学和医学基准数据”中的数据保存在单独的词典中。 Dictionary names can be whatever (eg 'subject' and 'biomed')
字典名称可以是任意名称(例如“ subject”和“ biomed”)
Current code to parse the XML
file and access the tags: 当前代码以解析
XML
文件并访问标签:
from lxml import etree
f_path = 'data store/cortex_full.xml' # enter path of xml file
# open and parse xml file
with open(f_path, 'r', encoding='utf-8') as f: # set encoding to utf-8 for mac
root = etree.parse(f)
namespaces = {'o': 'urn:schemas-microsoft-com:office:office',
'x': 'urn:schemas-microsoft-com:office:excel',
'ss': 'urn:schemas-microsoft-com:office:spreadsheet'}
ws = root.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0:
tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
if len(tables) > 0:
rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
for row in rows:
cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
Any suggestions on how to approach this? 关于如何处理此问题的任何建议? If a dictionary is not optional, also happy to work with other suggestions.
如果字典不是可选的,也很乐意与其他建议配合使用。
Make sure you declare the below well before 确保您在声明以下内容之前
subject={}
bio={}
d=None #If this doesn't work then use d={}
And consider replacing 并考虑更换
for row in rows:
cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
with 同
for row in rows:
cells = row.xpath('./ss:Cell', namespaces=namespaces)
if(len(cells)==2):
key=None
item=None
for cell in cells:
if(cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=="HeadTableParameterName"):
key=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
else:
item=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
if(not(key==None or item==None)):
d[key]=item
elif len(cells)==1:
if(cells[0].attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=='HeadTableTitle'):
if(cells[0].xpath('./ss:Data', namespaces=namespaces)[0].text=='Biological and Medical Baseline Data'):
d=bio
else:
d=subject
print(bio)
print(subject)
Although not necessary I've put in some checks just to give an idea, but you can extend the checks to make it more robust. 尽管没有必要我进行一些检查只是为了给出一个想法,但是您可以扩展检查以使其更强大。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.