简体   繁体   中英

Replace a whole block of code, part of an HTML for another block, with BS4 in Python

I want to replace a whole code structure using bs4, I have a source html and a target html

t_soup = BeautifulSoup(target_html, 'html.parser')
s_soup = BeautifulSoup(source_html, 'html.parser')

The first code is in the target:

//Block of code number 0
<div class="td-module-thumb">
    //Some html code
</div>

//Block of code number 1
<div class="td-module-thumb">
    <a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
       <img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
    </a>
</div>

and I want to replace what is contained in the target, specifically in block[1] for what is contained in the source in block[1], that is:

//Block of code number 0
<div class="td-module-thumb">
    //Some html code
</div>

//Block of code number 1
<div class="td-module-thumb">
    <a href="another_url.html" rel="bookmark_2" class="new-class">
       <img width="356" height="220" class="exit" src="../other_image_here.jpg" >
    </a>
</div>

They have the same <div class="td-module-thumb">

My code to make the replacements is the following:

left_column_selector = 'div.td-module-thumb'
left_column = s_soup.select(left_column_selector)[1]

NOTE:

>>> type(s_soup.select(left_column_selector)[1])
<class 'bs4.element.Tag'>

This are different tries of my last line of code, the one that actually makes the replacement:

// #1
t_soup.select(left_column_selector)[1].replace_with(str(left_column))

// #2
t_soup.select(left_column_selector)[1].string.replace_with(left_column)

// #3
t_soup.select(left_column_selector)[1].string.replace_with(left_column.string)

// #4
t_soup.select(left_column_selector)[1].replace_with(left_column.string)

Everything works fine except the las line of code. In consequence the code in target is not getting replaced for source

I would do it wholesale, as it were - delete target, insert source:

selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[0]
for div in t_soup.select(selector): 
    div.decompose()
t_soup.select_one('doc').insert(1, to_graft)

Edit:

Let's say your files look like this:

target = """<root> I am the target
<div class="td-module-thumb">
    don't touch me!
</div>
<div class="td-module-thumb">
    replace me!
    <a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
       <img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
    </a>
</div>
</root>
"""
source = """<root><div class="td-module-thumb">
    I'm the irrelevant part of the source
</div>
<div class="td-module-thumb">
    move me to target!
    <a href="another_url.html" rel="bookmark_2" class="new-class">
       <img width="356" height="220" class="exit" src="../other_image_here.jpg" >
    </a>
</div>
</root>
"""

Then this should do it:

t_soup = bs(target,'lxml')
s_soup = bs(source,'lxml')
selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[1]
to_remove = t_soup.select(selector)
to_remove[1].decompose()
t_soup.select_one('root').insert(2, to_graft)
t_soup

Output:

<root> I am the target
<div class="td-module-thumb">
    don't touch me!
</div><div class="td-module-thumb">
    move me to target!
    <a class="new-class" href="another_url.html" rel="bookmark_2">
<img class="exit" height="220" src="../other_image_here.jpg" width="356"/>
</a>
</div>

</root>

You can do it by performing Javascript. Import Pyv8 and change the parent's container innerHTML value. The script below returns true if the replacement worked.

import PyV8
ctx = PyV8.JSContext()
ctx.enter()

print ctx.eval(document.getELementById('#parentContainer').innerHTML = newhtml; if (document.getELementById('#parentContainer').innerHTML == newhtml) {return true;} else {return false;})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM