简体   繁体   中英

Use BeautifulSoup to Replace Every Occurrence of XML Tag with Another Tag

I am trying to replace every occurrence of an XML tag in a document (call it the target) with the contents of a tag in a different document (call it the source). The tag from the source could contain just text, or it could contain more XML.

Here is a simple example of what I am not able to get working:

test-source.htm :

<?xml version="1.0" encoding="utf-8"?>
<html>
    <head>
    </head>
    <body>
    <srctxt>text to be added</srctxt>
    </body>
</html>

test-target.htm :

<?xml version="1.0" encoding="utf-8"?>
<html>
    <head>
    </head>
    <body>
    <replacethis src="test-source.htm"></replacethis>
    <p>irrelevant, just here for filler</p>
    <replacethis src="test-source.htm"></replacethis>
    </body>
</html>

replace_example.py :

import os
import re
from bs4 import BeautifulSoup
# Just for testing

source_file = "test-source.htm"
target_file = "test-target.htm"

with open(source_file) as s:
    source = BeautifulSoup(s, "lxml")

with open(target_file) as t:
    target = BeautifulSoup(t, "lxml")

source_tag = source.srctxt

for tag in target():
    for attribute in tag.attrs:
        if re.search(source_file, str(tag[attribute])):
            tag.replace_with(source_tag)

with open(target_file, "w") as w:
    w.write(str(target))

This is my unfortunate test-target.htm after running replace_example.py

<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>

<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

The first replacethis tag is now gone and the second replacethis tag has been replaced. This same problem happens with "insert" and "insert_before".

The output I want is:

<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>    
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

Can someone please point me in the right direction?


Additional Complications: The example above is the simplest case where I could reproduce the problem I seem to be having with BeautifulSoup, but it does not convey the full detail of the problem I'm trying to solve. Actually, I have a list of targets and sources. The replacethis tag needs to be replaced by the contents of a source only if the src attribute contains a reference to a source in the list. So I could use the replace method, but it would require writing a lot more regex than if I could convince BeautifulSoup to work. If this problem is a BeautifulSoup bug, then maybe I'll just have to write the regex instead.

You could use another parser ( html.parser ) if you want to get rid of extra tags.

BS4's replace_with behavior looks like some bug in library.

As a partial solution you can just call

target_text.replace('<replacethis></replacethis>', source_text)

First, it is highly advised to not use regex on [X]HTML documents . Since you are modifying XML content, consider an lxml solution which you do have installed being the parsing engine in your BeautifulSoup calls. No for or if logic needed for this approach.

Specifically, consider XSLT , the special-purpose language, designed to transform XML into other XML, HTML, even json/csv/txt files. XSLT maintains the document() function allowing you to parse across documents. Python's lxml can run XSLT 1.0 scripts.

XSLT (save as .xsl in same folder as source file, adjust 'replacethis' and 'srctxt' names)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- UPDATE <replacethis> TAG WITH <srctxt> FROM SOURCE -->
  <xsl:template match="replacethis">
    <xsl:copy-of select="document('test-source.htm')/html/body/srctxt"/>
  </xsl:template>

</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD XML AND XSL SOURCES
doc = et.parse('test-target.htm')
xsl = et.parse('XSLTScript.xsl')

# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(doc)

# OUTPUT TO SCREEN
print(result)    

# OUTPUT TO FILE
with open('test-target.htm', 'wb') as f:
    f.write(result)

Output

<?xml version="1.0"?>
<html>
  <head/>
  <body>
    <srctxt>text to be added</srctxt>
    <p>irrelevant, just here for filler</p>
    <srctxt>text to be added</srctxt>
  </body>
</html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM