[英]BeautifulSoup - remove children but keep their contents
I am creating a web scraper , and I have issue fetching the pages whose are most likely generated, like this:我正在创建一个 web刮板,并且在获取最有可能生成的页面时遇到问题,如下所示:
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
</code>
</div>
</body>
</html>
Most important is content between code
tags.最重要的是
code
标签之间的内容。 Plan is to extract text between计划是提取文本之间
tags (or, remove those标签(或者,删除那些
tags and keep the rest of the DOM as it is.标记并保持 DOM 的 rest 原样。
So I need output like this:所以我需要这样的 output :
<html>
<body>
<div >
<code>
text text and more text
</code>
</div>
</html>
</body>
My tries as following..我的尝试如下..
from bs4 import BeautifulSoup
bs = BeautifulSoup(payload, 'lxml')
with open('/tmp/out.html', 'w+') as f:
for t in bs.find_all():
for q in t.find_all('code'):
# print(t.text, t.next_sibling)
f.write(q.text)
but this doesn't give good results.. From what I learned, bs main purpose is to extract elements, so that is reason why I tried recreating the dom in another file.但这并没有给出好的结果。根据我所学到的,bs 的主要目的是提取元素,这就是我尝试在另一个文件中重新创建 dom 的原因。
Thanks!谢谢!
You can try this:你可以试试这个:
from bs4 import BeautifulSoup
payload='''
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
</code>
</div>
</body>
</html>
'''
soup = BeautifulSoup(payload, 'lxml')
for match in soup.find_all('code'):
new_t=soup.new_tag('code')
new_t.string=match.text
match.replace_with(new_t)
with open(r'prove.html', "w") as file:
file.write(str(soup))
Output (prove.html) : Output (prove.html) :
<html>
<body>
<div>
<code>
<my-component v-bind:prop1="parentValue"></my-component>
<!-- Or more succinctly, -->
<my-component :prop1="parentValue"></my-component>
</code>
</div>
<div>
<code>
<my-component v-on:myEvent="parentHandler"></my-component>
<!-- Or more succinctly, -->
<my-component @myEvent="parentHandler"></my-component>
</code>
</div>
</body>
</html>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.