简体   繁体   English

使用python lxml将xml转换为json

[英]Transforming xml to json with python lxml

Basically, I want to transform and xml to json using python3 and the lxml -library. 基本上,我想使用python3lxml -library将xml转换为json The important thing here is, that I want to preserve all text , tails , tags and the order of the xml. 这里重要的是,我想保留所有texttailstag和xml的顺序 Below is an example of what my program should be able to do: 以下是我的程序应该能够执行的操作的一个示例:

What I have 我有的

<root>
   <tag>
      Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
   </tag>
</root>

What I want (python dict/json) 我想要什么(python dict / json)

{
  "root":{
    "tag":[
        {"text":"Some tag-text"},
        {"subtag":{"text":"Some subtag-text"}},
        {"text":"Some tail-text"}
      ]
  }
}

This is just a very simplified example. 这只是一个非常简化的示例。 The files I need to transform are way bigger and have more nestings. 我需要转换的文件更大并且有更多的嵌套。

Also, I cant use the xmltodict library for this, only lxml. 另外,我不能为此使用xmltodict库,只能使用lxml。

Im almost 99% sure there is some elegant way to do this recursively, but so far I haven't been able to write a solution that works the way I want it to. 我几乎99%的人肯定有某种优雅的方法可以递归地执行此操作,但是到目前为止,我还无法编写出一种符合我想要的方式的解决方案。

Thanks a lot for the help 非常感谢您的帮助

EDIT: Why this Question is not a duplicate of Converting XML to JSON using Python? 编辑:为什么这个问题不是使用Python将XML转换为JSON的重复

I understand there is no such thing as a one to one mapping from xml to json. 我了解不存在从xml到json的一对一映射。 Im specifically asking for a way that preserves the text-order like in the example above. 我专门要求一种保留文本顺序的方法,如上面的示例。

Also, using xmltodict doesn't achieve that goal. 同样,使用xmltodict不能达到该目标。 F.eg, transforming the xml from the example above with xmltodict will result in the following structure: 例如,使用xmltodict转换上例中的xml将产生以下结构:

root:
    tag:
        text: 'Some tag-text Some tail-text'
        subtag: 'Some subtag-text'

you can see, that the tail part "Some tail text" was concatenated with "Some tag-text" 您会看到,尾部“ Some tail text”“ Some tag-text”串联在一起

thanks 谢谢

I think if you need to preserve document order (what you referenced as "text-order"), XSLT is a good option. 我认为,如果需要保留文档顺序(称为“文本顺序”),XSLT是一个不错的选择。 XSLT can output plain text which can be loaded as json. XSLT可以输出可以作为json加载的纯文本。 Luckily lxml supports XSLT 1.0 . 幸运的是, lxml支持XSLT 1.0

Example... 例...

XML Input (input.xml) XML输入 (input.xml)

<root>
    <tag>
        Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
    </tag>
</root>

XSLT 1.0 (xml2json.xsl) XSLT 1.0 (xml2json.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="*">
    <xsl:if test="position() != 1">, </xsl:if>
    <xsl:value-of select="concat('{&quot;',
      local-name(),
      '&quot;: ')"/>
    <xsl:choose>
      <xsl:when test="count(node()) > 1">
        <xsl:text>[</xsl:text>
        <xsl:apply-templates/>
        <xsl:text>]</xsl:text>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates/>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:text>}</xsl:text>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:if test="position() != 1">, </xsl:if>
    <xsl:value-of select="concat('{&quot;text&quot;: &quot;', 
      normalize-space(), 
      '&quot;}')"/>
  </xsl:template>

</xsl:stylesheet>

Python 蟒蛇

import json
from lxml import etree

tree = etree.parse("input.xml")

xslt_root = etree.parse("xml2json.xsl")
transform = etree.XSLT(xslt_root)

result = transform(tree)

json_load = json.loads(str(result))

json_dump = json.dumps(json_load, indent=2)

print(json_dump)

For informational purposes, the output of the xslt ( result ) is: 出于参考目的,xslt( result )的输出为:

{"root": {"tag": [{"text": "Some tag-text"}, {"subtag": {"text": "Some subtag-text"}}, {"text": "Some tail-text"}]}}

The printed output from Python (after loads()/dumps()) is: Python的打印输出(在loads()/ dumps()之后)为:

{
  "root": {
    "tag": [
      {
        "text": "Some tag-text"
      },
      {
        "subtag": {
          "text": "Some subtag-text"
        }
      },
      {
        "text": "Some tail-text"
      }
    ]
  }
}

Here's an alternative to "@Daniel Haley's" solution 这是“ @Daniel Haley's”解决方案的替代方案

def recu(root):
    my=[]
    if root.text:
        my.append({"text":root.text})
    if len(root):
        for elem in root:
            my=my+[recu(elem)]
            if elem.tail:
                my=my+[{"text":elem.tail}]
    my = my[0] if len(my)==1 else my
    return {root.tag:my}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM