简体   繁体   English

在Python中逐项列出大型xml文件

[英]Itemizing big xml files in Python

I am designing some sort of ETL pipeline, where I'd like 1st to split the input XML dataset into individual XML files related to each item. 我正在设计某种ETL管道,在这里我希望1st将输入XML数据集拆分为与每个项目相关的单个XML文件。 The input dataset(s) are basically exports of metadata under specific models (current example is EDM). 输入数据集基本上是特定模型下的元数据输出(当前示例为EDM)。 I am rather comfortable with XSLT and was hoping to use that to avoid too much Python on this matter, which is supposedly not that complex. 我对XSLT非常满意,并希望以此来避免在此问题上使用过多的Python,这本来就不那么复杂。

I have browsed many threads, including Lisa Daly's Fast_iter (Cf. https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ ). 我浏览了许多主题,包括Lisa Daly的Fast_iter(参见https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ )。 I tried different approach but I always end up stuck when writing the files (either no output, or serialization issues). 我尝试了不同的方法,但是在写入文件时总是卡住(无输出或序列化问题)。 Looking for some seasoned feedback please ?! 寻找一些经验丰富的反馈吗?

Dataset structure 数据集结构

<rdf:RDF ...many namespaces...>
    <!--ITEM1 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url"/>
        <ore:Aggregation rdf:about="http://some/url">
            <...>
        </ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">
            <...>
        </ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">
            <...>      
        </edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM2 NODE-->
    <ore:aggregates>
        <...>      
    </ore:aggregates>

    <!--ITEM3 NODE-->
    <ore:aggregates>
        <...>      
    </ore:aggregates>
</rdf:RDF>

Expected result 预期结果

<!--ITEM 1-->
<rdf:RDF ...many namespaces...>
    <edm:ProvidedCHO rdf:about="http://some/url"/>
    <ore:Aggregation rdf:about="http://some/url">
        <...>
    </ore:Aggregation>
    <ore:Proxy rdf:about="http://some/url">
        <...>
    </ore:Proxy>
    <edm:EuropeanaAggregation rdf:about="http://some/url">
        <...>      
    </edm:EuropeanaAggregation>
</rdf:RDF>


CURRENT TRYOUTS 当前的试用

Trying to use lxml to apply an itemizing XSLT once (script+xslt) 尝试使用lxml一次应用逐项XSLT(script + xslt)

from lxml import etree as ET
    dom = ET.parse(source)
    xslt = ET.parse(xsl_filename)
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    print(ET.tostring(newdom, pretty_print=True))
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet exclude-result-prefixes="xsi xlink xml" version="2.0"
    xmlns:many="namespaces">

    <xsl:output encoding="UTF-8" indent="yes"/>

    <!--<xsl:param name="output" select="'/Users/yep/Code/+dev/test data/output/'"/>-->
    <xsl:param name="output" select="'/home/yep/data/split/'"/>
    <xsl:param name="children" select="/rdf:RDF/ore:aggregates"/>

    <!-- ROOT MATCH -->
    <xsl:template match="/">
        <xsl:for-each select="$children">
            <xsl:call-template name="itemize"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="itemize">

            <xsl:variable name="uri" select="translate(ore:Proxy/dc:identifier, ' ', '_')"/>
            <xsl:variable name="ns"/>
            <xsl:variable name="fullOutput" select="concat($output, $uri)"/>
            <xsl:result-document href="{$fullOutput}.xml" method="xml">
                <xsl:element name="rdf:RDF">
                    <xsl:copy-of select="namespace::*"/>
                    <xsl:copy-of select="*"/>
                </xsl:element>
            </xsl:result-document>
    </xsl:template>

</xsl:stylesheet>

... no output. ...无输出。 Also tried 'write' but not working 还尝试了“写”但不起作用

Trying via ETree 通过ETree尝试

import xml.etree.ElementTree as ET
    root = ET.parse(source).getroot()

    # namespaces variable generated from a json file
    jsonFile = open("application/models/namespaces.json")
    jsonStr = jsonFile.read()
    namespaces = json.loads(jsonStr)

    for item in root.findall("ore:aggregates",namespaces):
        newTree = ET.parse("/home/yep/application/services/create/sample.xml")
        newroot = newTree.getroot()

        for node in item.findall("edm:ProvidedCHO",namespaces):
            newroot.append(node)
            ET.SubElement(newroot,node)

        filename = "/home/yep/data/split/" + str(i) + ".xml"
        newTree.write(filename)

TypeError: cannot serialize <Element '{http://www.europeana.eu/schemas/edm/}ProvidedCHO' at 0x7f4768a03688> (type Element)

I think the issue is related to the fact that I am handling namespaces not properly or maybe because I'm still an XSLT approach towards data when it is Python ... some help would be appreciated :) 我认为该问题与以下事实有关:我没有正确处理名称空间,或者可能是因为我仍然是使用Python的XSLT数据处理方法……可以得到一些帮助:)

Since you're trying to process XSLT with lxml, you're stuck with XSLT 1.0. 由于您尝试使用lxml处理XSLT,因此您将无法使用XSLT 1.0。 Since 1.0 doesn't support xsl:result-document , you'll have to use the exlst document extension (which luckily lxml supports). 由于1.0不支持xsl:result-document ,因此您必须使用exlst document扩展名(幸运的是lxml支持)。

Here's an example... 这是一个例子

XML Input (test.xml) XML输入 (test.xml)

<rdf:RDF xmlns:rdf="http://some rdf uri" xmlns:edm="http://some edm uri" xmlns:ore="http://some ore uri">
    <!--ITEM1 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item1</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item1</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item1</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item1</edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM2 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item2</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item2</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item2</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item2</edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM3 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item3</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item3</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item3</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item3</edm:EuropeanaAggregation>
    </ore:aggregates>
</rdf:RDF>

XSLT 1.0 (test.xsl) XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl">
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*/*">
    <xsl:apply-templates select=".." mode="copy">
      <xsl:with-param name="target_id" select="generate-id()"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="/*" mode="copy">
    <xsl:param name="target_id"/>
    <exsl:document href="{$target_id}.xml" indent="yes">
      <xsl:copy>
        <xsl:copy-of select="@*|*[generate-id()=$target_id]/*"/>
      </xsl:copy>      
    </exsl:document>
  </xsl:template>

</xsl:stylesheet>

Python 蟒蛇

from lxml import etree

tree = etree.parse("test.xml")
xslt = etree.parse("test.xsl")

tree.xslt(xslt)

Output (The filenames are based on the generated ID's so they will probably differ when running my code.) 输出 (文件名基于生成的ID,因此在运行我的代码时它们可能会有所不同。)

idm253366124.xml idm253366124.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item1</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item1</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item1</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item1</edm:EuropeanaAggregation>
</rdf:RDF>

idm219411756.xml idm219411756.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item2</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item2</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item2</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item2</edm:EuropeanaAggregation>
</rdf:RDF>

idm219410244.xml idm219410244.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item3</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item3</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item3</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item3</edm:EuropeanaAggregation>
</rdf:RDF>

Update for dynamic path... 更新动态路径...

XSLT XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" 
  xmlns:ore="http://some_ore_uri"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl">
  <xsl:strip-space elements="*"/>

  <xsl:key name="elem_by_id" match="*" use="generate-id()"/>

  <xsl:template match="/*" name="root">
    <xsl:apply-templates select="*"/>
  </xsl:template>

  <xsl:template match="*">
    <xsl:apply-templates select="/*" mode="copy">
      <xsl:with-param name="target_id" select="generate-id()"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="/*" mode="copy">
    <xsl:param name="target_id"/>
    <exsl:document href="temp/{$target_id}.xml" indent="yes">
      <xsl:copy>
        <xsl:copy-of select="@*|key('elem_by_id',$target_id)/*"/>
      </xsl:copy>      
    </exsl:document>
  </xsl:template>

</xsl:stylesheet>

Python 蟒蛇

from lxml import etree

tree = etree.parse("test.xml")
xslt = etree.parse("test.xsl")

target_path = "/rdf:RDF/ore:aggregates"

try:
    elem = xslt.xpath("/xsl:stylesheet/xsl:template[@name='root']/xsl:apply-templates",
                      namespaces={"xsl": "http://www.w3.org/1999/XSL/Transform"})[0]
    elem.attrib["select"] = target_path
except IndexError:
    print("Could not find xsl:template to update.")

tree.xslt(xslt)

Alternatively, consider passing a parameter to XSLT from Python using lxml to iterate and create separate XML files by position() number of each ore:aggregate : 或者,考虑使用参数lxml将参数从Python传递到XSLT,以按每个ore:aggregate position()编号迭代并创建单独的XML文件:

XSLT XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:rdf="rdf.com" 
                              xmlns:ore="ore.com" 
                              xmlns:edm="edm.com">
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>

    <!-- XSL PARAM -->
    <xsl:param name="item_num"/>

    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- EMPTY TEMPLATE TO REMOVE NON-SELECTED ITEMS -->        
    <xsl:template match="ore:aggregates[position()!=$item_num]"/>

    <xsl:template match="comment()"/>
</xsl:stylesheet>

Python 蟒蛇

import lxml.etree as et

# LOAD XML AND XSL SCRIPT
ns = {"ore": "ore.com"}                    # ORE NAMESPACE
xml = et.parse('/path/to/input/xml')
xsl = et.parse('/path/to/XSLT/script.xsl')
transform = et.XSLT(xsl)

# LOOP THROUGH ALL NODE COUNTS AND PASS PARAMETER TO XSLT
ore_agg_count = len(xml.xpath('//ore:aggregates', namespaces=ns))
for i in range(ore_agg_count):
   n = et.XSLT.strparam(str(i))            # NAME OF XSL PARAMETER
   result = transform(xml, item_num=n)

   # SAVE XML TO FILE
   with open('ore_aggregates_{}.xml'.format(i), 'wb') as f:
       f.write(result)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM