[英]BeautifulSoup extract text after tag
Need to scrap text appears before and after script tag, HTML:需要在脚本标签 HTML 前后出现的文字:
<div class="card-body">
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">EUR/USD signal</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="timeago fw-normal small" datetime="1656687480000" timeago-id="10">1 day ago</span>
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
From
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
<div class="signal-row signal-status signal-color">
Filled
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Sold at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGMP');</script>1.0407
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Bought at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGML');</script>1.0408
</div>
need to extract UTC and +5:30 and other details available different mentioned in html span tag eg: <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
Tried using next_sibling but it returns nothing, tried using etree and xpath but this is also not returning anything.需要提取 UTC 和 +5:30 以及 html 跨度标签中提到的其他可用细节,例如: <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
尝试使用 next_sibling 但没有返回任何内容,尝试使用 etree 和 xpath 但这也没有返回任何内容。
I tried using lxml etree:我尝试使用 lxml etree:
dom = etree.HTML(str(soup))
t = dom.xpath("//div[@class='ms-auto signal-value signal-color']/span/script/following-sibling::text()")
for i in t:
print(i.text)
Using next siblling:使用下一个兄弟:
l = soup.find('script').next_siblings
Expected Output: UTC +05:30 20:28预计 Output:UTC +05:30 20:28
Simply call .text
or get_text()
method on your element, the script tag will be ignored.只需在您的元素上调用.text
或get_text()
方法,脚本标记将被忽略。
soup.select_one('.card-body span').parent.get_text(' ', strip=True)
Note Assuming HTML is generated dynamically, so prerequisites differ from facts in your question.注意假设 HTML 是动态生成的,因此先决条件与您问题中的事实不同。
It will select all the <span>
and iterate over ResultSet
to print the texts.它将 select 所有<span>
并遍历ResultSet
以打印文本。
from bs4 import BeautifulSoup
html='''
<div class="card-body">
<div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('.card-body span'):
print(e.parent.get_text(' ', strip=True))
UTC +05:30 20:28
UTC +05:30 23:28
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.