简体   繁体   English

BeautifulSoup - 移除孩子但保留他们的内容

[英]BeautifulSoup - remove children but keep their contents

I am creating a web scraper , and I have issue fetching the pages whose are most likely generated, like this:我正在创建一个 web刮板,并且在获取最有可能生成的页面时遇到问题,如下所示:

<html>
<body>
<div >
<code>
    <p class="nt">&lt;my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
    <p class="c">&lt;!-- Or more succinctly, --&gt;</p>
    <p class="nt">&lt;my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
</code>
</div>
<div>
<code>
    <p class="nt">&lt;my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
    <p class="c">&lt;!-- Or more succinctly, --&gt;</p>
    <p class="nt">&lt;my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">&gt;&lt;/my-component&gt;</p>

</code>
</div>
</body>
</html>

Most important is content between code tags.最重要的是code标签之间的内容。 Plan is to extract text between计划是提取文本之间

tags (or, remove those标签(或者,删除那些

tags and keep the rest of the DOM as it is.标记并保持 DOM 的 rest 原样。

So I need output like this:所以我需要这样的 output :

<html>
<body>
<div >
<code>
  text text and more text
</code>
</div>
</html>
</body>

My tries as following..我的尝试如下..

from bs4 import BeautifulSoup


bs = BeautifulSoup(payload, 'lxml')

with open('/tmp/out.html', 'w+') as f:
    for t in bs.find_all():
        
        for q in t.find_all('code'):
            # print(t.text, t.next_sibling)
            f.write(q.text)


but this doesn't give good results.. From what I learned, bs main purpose is to extract elements, so that is reason why I tried recreating the dom in another file.但这并没有给出好的结果。根据我所学到的,bs 的主要目的是提取元素,这就是我尝试在另一个文件中重新创建 dom 的原因。

Thanks!谢谢!

You can try this:你可以试试这个:

from bs4 import BeautifulSoup

payload='''
<html>
<body>
<div >
<code>
    <p class="nt">&lt;my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
    <p class="c">&lt;!-- Or more succinctly, --&gt;</p>
    <p class="nt">&lt;my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
</code>
</div>
<div>
<code>
    <p class="nt">&lt;my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">&gt;&lt;/my-component&gt;</p>
    <p class="c">&lt;!-- Or more succinctly, --&gt;</p>
    <p class="nt">&lt;my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">&gt;&lt;/my-component&gt;</p>

</code>
</div>
</body>
</html>
'''

soup = BeautifulSoup(payload, 'lxml')

for match in soup.find_all('code'):
    new_t=soup.new_tag('code')
    new_t.string=match.text
    match.replace_with(new_t)

with open(r'prove.html', "w") as file:
    file.write(str(soup))

Output (prove.html) : Output (prove.html)

<html>
<body>
<div>
<code>
&lt;my-component v-bind:prop1="parentValue"&gt;&lt;/my-component&gt;
&lt;!-- Or more succinctly, --&gt;
&lt;my-component :prop1="parentValue"&gt;&lt;/my-component&gt;
</code>
</div>
<div>
<code>
&lt;my-component v-on:myEvent="parentHandler"&gt;&lt;/my-component&gt;
&lt;!-- Or more succinctly, --&gt;
&lt;my-component @myEvent="parentHandler"&gt;&lt;/my-component&gt;
</code>
</div>
</body>
</html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM