[英]Best way to extract data from xml using lxml
I need to parse dozens of continuously arriving xml files, pulling out a certain data set from them.我需要解析几十个连续到达的 xml 文件,从中提取出某个数据集。 Here is my file example
这是我的文件示例
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<BPS Created="2020-04-03 09:16:11">
<Machine SerialNumber="2602" Site="" DPRelease="58.5" SoftwareRelease="4.0.3" VersionInfo="" Name="419ST39823" Type="BPS C2">
<Expected Currency="RUB" Value="0"/>
<ParameterSection Number="123456789" StartTime="" EndTime="" opmodename="01">
<Operator>123456789</Operator>
<HeadercardUnit HeaderCardID="" DepositID="123456789" denomvalue="5000" DeclaredDepositAmount="0" Currency="RUB" StartTime="2020-04-03 09:15:18" MilliSec="1" EndTime="2020-04-03 09:16:09" Rejects="YES">
<Counter Currency="RUB" DenomID="1353" Value="500" Quality="Acc" Issue="C" Output="Stacked" Number="17"/>
<Counter Currency="RUB" DenomID="1354" Value="1000" Quality="Acc" Issue="C" Output="Stacked" Number="31"/>
<Counter Currency="RUB" DenomID="1338" Value="1000" Quality="Acc" Issue="B" Output="Stacked" Number="3"/>
<Counter Currency="RUB" DenomID="1293" Value="2000" Quality="Acc" Issue="D" Output="Stacked" Number="5"/>
<Counter Currency="RUB" DenomID="1355" Value="5000" Quality="Acc" Issue="C" Output="Stacked" Number="27"/>
<Counter Currency="RUB" DenomID="1339" Value="5000" Quality="Acc" Issue="A" Output="Stacked" Number="5"/>
</HeadercardUnit>
</ParameterSection>
</Machine>
</BPS>
I use XPath to extract values that i need:我使用 XPath 来提取我需要的值:
serial = etree.XPath("/BPS/Machine/@SerialNumber")
control = etree.XPath("/BPS/Machine/ParameterSection/@Number")
oper = etree.XPath("/BPS/Machine/ParameterSection/Operator/text()")
dep_num = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/@DepositID")
dep_time = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/@StartTime")
counters = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/Counter")
Is this a good way to extract what i need?这是提取我需要的东西的好方法吗? Or do i need use each tag as lxml Element and work with it?
还是我需要将每个标签用作 lxml 元素并使用它? Probably using
find
function is slower that xpath
可能使用
find
function 比xpath
慢
Based strictly on the xml in your question, I believe you are looking for something like this:严格基于您问题中的 xml ,我相信您正在寻找这样的东西:
serial = """[your xml above]"""
from lxml import etree
import pandas as pd
content = serial.encode('utf-8')
doc = etree.XML(content)
targets = doc.xpath('/BPS/Machine/ParameterSection')
data = []
for target in targets:
data.append(target.xpath("../@SerialNumber")[0])
data.append(target.xpath("./@Number")[0])
data.append(target.xpath("./Operator/text()")[0])
data.append(target.xpath("./HeadercardUnit/@DepositID")[0])
data.append(target.xpath("./HeadercardUnit/@StartTime")[0])
counters = target.xpath("./HeadercardUnit/Counter")
vals = []
nums = []
for counter in counters:
vals.append(counter.xpath('./@Value')[0])
nums.append(counter.xpath('./@Number')[0])
data.append(vals)
data.append(nums)
columns = ['serial', 'control' , 'oper','dep_num' , 'dep_time','Value','Number']
pd.DataFrame([data],columns=columns)
Output: Output:
serial control oper dep_num dep_time Value Number
0 2602 123456789 123456789 123456789 2020-04-03 09:15:18 [500, 1000, 1000, 2000, 5000, 5000] [17, 31, 3, 5, 27, 5]
Obviously, you can play with the structure of the dataframe to adjust it to your needs.显然,您可以使用 dataframe 的结构来调整它以满足您的需求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.