简体   繁体   English

使用 lxml 从 xml 中提取数据的最佳方法

[英]Best way to extract data from xml using lxml

I need to parse dozens of continuously arriving xml files, pulling out a certain data set from them.我需要解析几十个连续到达的 xml 文件,从中提取出某个数据集。 Here is my file example这是我的文件示例

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<BPS Created="2020-04-03 09:16:11">
  <Machine SerialNumber="2602" Site="" DPRelease="58.5" SoftwareRelease="4.0.3" VersionInfo="" Name="419ST39823" Type="BPS C2">
    <Expected Currency="RUB" Value="0"/>
    <ParameterSection Number="123456789" StartTime="" EndTime="" opmodename="01">
      <Operator>123456789</Operator>
      <HeadercardUnit HeaderCardID="" DepositID="123456789" denomvalue="5000" DeclaredDepositAmount="0" Currency="RUB" StartTime="2020-04-03 09:15:18" MilliSec="1" EndTime="2020-04-03 09:16:09" Rejects="YES">
        <Counter Currency="RUB" DenomID="1353" Value="500" Quality="Acc" Issue="C" Output="Stacked" Number="17"/>
        <Counter Currency="RUB" DenomID="1354" Value="1000" Quality="Acc" Issue="C" Output="Stacked" Number="31"/>
        <Counter Currency="RUB" DenomID="1338" Value="1000" Quality="Acc" Issue="B" Output="Stacked" Number="3"/>
        <Counter Currency="RUB" DenomID="1293" Value="2000" Quality="Acc" Issue="D" Output="Stacked" Number="5"/>
        <Counter Currency="RUB" DenomID="1355" Value="5000" Quality="Acc" Issue="C" Output="Stacked" Number="27"/>
        <Counter Currency="RUB" DenomID="1339" Value="5000" Quality="Acc" Issue="A" Output="Stacked" Number="5"/>
      </HeadercardUnit>
    </ParameterSection>
  </Machine>
</BPS>

I use XPath to extract values that i need:我使用 XPath 来提取我需要的值:

serial = etree.XPath("/BPS/Machine/@SerialNumber")
control =  etree.XPath("/BPS/Machine/ParameterSection/@Number")
oper = etree.XPath("/BPS/Machine/ParameterSection/Operator/text()")
dep_num = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/@DepositID")
dep_time = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/@StartTime")
counters = etree.XPath("/BPS/Machine/ParameterSection/HeadercardUnit/Counter")

Is this a good way to extract what i need?这是提取我需要的东西的好方法吗? Or do i need use each tag as lxml Element and work with it?还是我需要将每个标签用作 lxml 元素并使用它? Probably using find function is slower that xpath可能使用find function 比xpath

Based strictly on the xml in your question, I believe you are looking for something like this:严格基于您问题中的 xml ,我相信您正在寻找这样的东西:

serial = """[your xml above]"""

from lxml import etree
import pandas as pd

content = serial.encode('utf-8')
doc = etree.XML(content)
targets = doc.xpath('/BPS/Machine/ParameterSection')
data = []
for target in targets:
   data.append(target.xpath("../@SerialNumber")[0])
   data.append(target.xpath("./@Number")[0])
   data.append(target.xpath("./Operator/text()")[0])
   data.append(target.xpath("./HeadercardUnit/@DepositID")[0])
   data.append(target.xpath("./HeadercardUnit/@StartTime")[0])
   counters = target.xpath("./HeadercardUnit/Counter")
   vals = []
   nums = []
   for counter in counters:
        vals.append(counter.xpath('./@Value')[0])
        nums.append(counter.xpath('./@Number')[0])
   data.append(vals)
   data.append(nums)
columns = ['serial', 'control' , 'oper','dep_num' , 'dep_time','Value','Number']
pd.DataFrame([data],columns=columns)

Output: Output:

    serial  control     oper         dep_num    dep_time    Value        Number
0   2602    123456789   123456789   123456789   2020-04-03 09:15:18     [500, 1000, 1000, 2000, 5000, 5000]     [17, 31, 3, 5, 27, 5]

Obviously, you can play with the structure of the dataframe to adjust it to your needs.显然,您可以使用 dataframe 的结构来调整它以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM