简体   繁体   English

如何使用 python 修改 xml 文件中嵌套元素的文本?

[英]How to modify the text of nested elements in xml file using python?

Currently I'm working on a corpus/dataset.目前我正在研究一个语料库/数据集。 It's in xml format as you can see the picture below.它采用 xml 格式,如下图所示。 I'm facing a problem.我面临一个问题。 I want to access all 'ne' elements one by one as shown in below picture.我想一一访问所有'ne'元素,如下图所示。 Then I want to access the text of the 'W' elements which are inside the 'ne' elements.然后我想访问'ne'元素内的'W'元素的文本 Then I want to concatenate thy symbols 'SDi' and 'EDi' with the text of these 'W' elements.然后我想将你的符号'SDi'和'EDi'与这些'W'元素的文本连接起来。 'i' can take any positive whole number starting from 1. In the case of 'SDi' I need only the text of first 'W' element that is inside the 'ne' element. 'i' 可以取从 1 开始的任何正整数。在 'SDi' 的情况下,我只需要在 'ne' 元素内的第一个 'W' 元素的文本 In the case of 'EDi' I need only the text of last 'W' element that is inside the 'ne' element.在“EDi”的情况下,我只需要“ne”元素内的最后一个“W”元素的文本 Currently I don't get anything as output after running the code.目前我在运行代码后没有得到任何 output 。 I think this is because of the fact that the element 'W' is never accessed.我认为这是因为元素“W”从未被访问过。 Moreover, i think that element 'W' is not accessed because it is a grandchild of element 'ne' therefore it can't be accessed directly rather it may be possible with the help its father node.此外,我认为元素'W'未被访问,因为它是元素'ne'的孙子,因此它不能直接访问,而是在其父节点的帮助下可能是可能的。

Note1: The number and names of sub elements inside 'ne' elements are not same.注1:“ne”元素中子元素的个数和名称不相同。

Note2: Only those things are explained here which needed.注2:这里只说明需要的东西。 You may find some other details in the coding/picture but ignore them.您可能会在编码/图片中找到一些其他细节,但忽略它们。

I'm using Spyder (python 3.6) Any help would be appreciated.我正在使用 Spyder (python 3.6) 任何帮助将不胜感激。

A picture from the XML file I'm working on is given below:我正在处理的 XML 文件中的图片如下所示: 在此处输入图像描述

Text version of XML file: Click here XML文件文本版: 点此

Sample/Expected output image (below):示例/预期 output 图像(下): 在此处输入图像描述

Coding I've done so far:到目前为止我所做的编码:

for i in range(len(List_of_root_nodes)):
true_false = True
current = List_of_root_nodes[i]
start_ID = current.PDante_ID
#print('start:', start_ID)  # For Testing
end_ID = None
number = str(i+1)  # This number will serve as i used with SD and ED that is (SDi and EDi)

discourse_starting_symbol = "SD" + number
discourse_ending_symbol = "ED" + number

while true_false:    
    if current.right_child is None:        
        end_ID = current.PDante_ID
        #print('end:', end_ID)  # For Testing
        true_false = False        
    else:        
        current = current.right_child

# Finding 'ne' element with id='start_ID'
ne_text = None
ne_id = None

for ne in myroot.iter('ne'):    
    ne_id = ne.get('id')

    # If ne_id matches with start_ID means the place where SDi is to be placed is found    
    if ne_id == start_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = " " + str(discourse_starting_symbol) + " " + ne_text
            w.text = boundary_and_text
            break

    # If ne_id matches with end_ID means the place where EDi is to be placed is found

    # Some changes Required here: Here the 'EDi' will need to be placed after the last 'W' element.
    # So last 'W' element needs to be accessed
    if ne_id == end_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = ne_text + " " + str(discourse_ending_symbol) + " "
            w.text = boundary_and_text
            break

Whenever you need to modify XML with various nuanced needs, consider XSLT , the special-purpose language designed to transform XML files.每当您需要修改 XML 以满足各种细微差别的需求时,请考虑XSLT ,这是一种旨在转换 XML 文件的专用语言。 You can run XSLT 1.0 scripts with Python's third-party module, lxml (not built-in etree ).您可以使用 Python 的第三方模块lxml (不是内置的etree )运行 XSLT 1.0 脚本。

Specifically, call the identity transform to copy XML as is and then run the two templates to add SDI to text of very first <W> and very last EDI to text of last <W> .具体来说,调用恒等变换以按原样复制 XML,然后运行两个模板将SDI添加到第一个<W>的文本中,并将最后一个 EDI 添加到最后一个<W>的文本中。 Solution will work if there are 10 or 10,000 <W> nodes, deeply nested or not.如果有 10 或 10,000 个<W>节点,无论是否深度嵌套,解决方案都会起作用。

To demonstrate with example data of StackOverflow's top Python and XSLT users, see online demo where SDI and EDI are added to first and last <user> node:为了演示 StackOverflow 的顶级 Python 和 XSLT 用户的示例数据,请参阅在线演示,其中将SDIEDI添加到第一个和最后一个<user>节点:

XSLT (save as.xsl file, a special.xml file to be loaded in Python) XSLT (另存为.xsl文件,一个特殊的.xml文件要在Python中加载)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM -->    
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT FIRST W NODE -->    
  <xsl:template match="W[count(preceding::W)=0]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('SDI ', text())"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT LAST W NODE -->    
  <xsl:template match="W[count(preceding::W)+1 = count(//W)]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('EDI ', text())"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (no loops or if/else logic) Python (无循环或 if/else 逻辑)

import lxml.etree as et

doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# TRANSFORM SOURCE DOC
result = transform(doc)

# OUTPUT TO CONSOLE
print(result)

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Something like this (a.xml is the XML you have uploaded):像这样的东西(a.xml 是您上传的 XML):

Note the code is not using any external library.请注意,代码没有使用任何外部库。

import xml.etree.ElementTree as ET

SD = 'SD'
ED = 'ED'

root = ET.parse('a.xml')

counter = 1

for ne in root.findall('.//ne'):
    w_lst = ne.findall('.//W')
    if w_lst:
        w_lst[0].text = '{}{} {}'.format(SD, counter, w_lst[0].text)
        if len(w_lst) > 1:
            w_lst[-1].text = '{} {}{}'.format(w_lst[-1].text, ED, counter)
        counter += 1
ET.dump(root)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM