在 Python 中将多个嵌套的 XML 解析为 Panda 数据框表

Question

<?xml version='1.0' encoding='UTF-8' ?>
  <DOC>
    <INFO1
      A = "1"
      B = "2"
      C = "3"
    >
      <INFO12
        D = "a"
      >
      </INFO12>
    </INFO1>
    <INFO2
      E = "4"
      F = "5"
      G = "6"
    >
      <INFO21
        H = "b"
      >
      </INFO21>
    </INFO2>
 </DOC>

TestFile="test.xml"
ttree = etree.parse(TestFile)
troot = ttree.getroot()
df_cols =["Col1", "Col2", "Col3", "Col4","Col5","Col6"
              "Col7", "Col8"]
df = pd.DataFrame(columns =df_cols)

for i in troot: 
    df = df.append(pd.Series([i.get('A'), i.get('B'),i.get('C'), i.get('D'),
                                     i.get('E'), i.get('F'),i.get('G'),i.get('H')],
                          index = df_cols), ignore_index=True)
        
df.head()

https://i.stack.imgur.com/pvBnm.png https://i.stack.imgur.com/pvBnm.png

Question: I'm trying to parse XML to a data frame in Python by using the xml.etree.cElementTree library.问题：我正在尝试使用 xml.etree.cElementTree 库将 XML 解析为 Python 中的数据框。 But how to make the result in a single line and including a and b, so will be '1, 2, 3, a, 4, 5, 6, b'.但是如何在一行中生成结果并包括 a 和 b，那么将是 '1, 2, 3, a, 4, 5, 6, b'。 Thank you!谢谢！

Answer 1

Can you use other libraries.你可以使用其他库。

from simplified_scrapy import SimplifiedDoc

html = '''
<?xml version='1.0' encoding='UTF-8' ?>
  <DOC>
    <INFO1
      A = "1"
      B = "2"
      C = "3"
    >
      <INFO12
        D = "a"
      >
      </INFO12>
    </INFO1>
    <INFO2
      E = "4"
      F = "5"
      G = "6"
    >
      <INFO21
        H = "b"
      >
      </INFO21>
    </INFO2>
 </DOC>
'''
doc = SimplifiedDoc(html)
infos = doc.DOC.children
row = [infos[0].A,infos[0].B,infos[0].C,infos[0].child.D,
    infos[1].E,infos[1].F,infos[1].G,infos[1].child.H]
print (row)

Result:结果：

['1', '2', '3', 'a', '4', '5', '6', 'b']

在 Python 中将多个嵌套的 XML 解析为 Panda 数据框表

问题描述

1 个解决方案

解决方案1
0 2020-09-24 12:59:09

在 Python 中将多个嵌套的 XML 解析为 Panda 数据框表

问题描述

1 个解决方案

解决方案1 0 2020-09-24 12:59:09

解决方案1
0 2020-09-24 12:59:09