简体   繁体   English

BeautifulSoup 提取标签后的文本

[英]BeautifulSoup extract text after tag

Need to scrap text appears before and after script tag, HTML:需要在脚本标签 HTML 前后出现的文字:

<div class="card-body">
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">EUR/USD signal</div>
        <div class="ms-auto signal-value signal-color xh-highlight">
            <span class="timeago fw-normal small" datetime="1656687480000" timeago-id="10">1 day ago</span>
        </div>
    </div>
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">
            From 
        </div>
        <div class="ms-auto signal-value signal-color xh-highlight">
            <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
            <script class="">w(hhmm(1656687480));</script>20:28
        </div>
    </div>
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">
            Till 
        </div>
        <div class="ms-auto signal-value signal-color xh-highlight">
            <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
            <script class="">w(hhmm(1656698280));</script>23:28
        </div>
    </div>
    <div class="signal-row signal-status signal-color">
        Filled 
    </div>
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">
            Sold at 
        </div>
        <div class="ms-auto signal-value signal-color user-select-all">
            <script>f('OCKGMP');</script>1.0407
        </div>
    </div>
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">
            Bought at 
        </div>
        <div class="ms-auto signal-value signal-color user-select-all">
            <script>f('OCKGML');</script>1.0408
        </div>

need to extract UTC and +5:30 and other details available different mentioned in html span tag eg: <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span> Tried using next_sibling but it returns nothing, tried using etree and xpath but this is also not returning anything.需要提取 UTC 和 +5:30 以及 html 跨度标签中提到的其他可用细节,例如: <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>尝试使用 next_sibling 但没有返回任何内容,尝试使用 etree 和 xpath 但这也没有返回任何内容。

I tried using lxml etree:我尝试使用 lxml etree:

dom = etree.HTML(str(soup))
t = dom.xpath("//div[@class='ms-auto signal-value signal-color']/span/script/following-sibling::text()")
for i in t:
     print(i.text)

Using next siblling:使用下一个兄弟:

l = soup.find('script').next_siblings

Expected Output: UTC +05:30 20:28预计 Output:UTC +05:30 20:28

Simply call .text or get_text() method on your element, the script tag will be ignored.只需在您的元素上调用.textget_text()方法,脚本标记将被忽略。

soup.select_one('.card-body span').parent.get_text(' ', strip=True)

Note Assuming HTML is generated dynamically, so prerequisites differ from facts in your question.注意假设 HTML 是动态生成的,因此先决条件与您问题中的事实不同。

Example例子

It will select all the <span> and iterate over ResultSet to print the texts.它将 select 所有<span>并遍历ResultSet以打印文本。

from bs4 import BeautifulSoup

html='''
<div class="card-body">
    <div>
        <div class="ms-auto signal-value signal-color xh-highlight">
            <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
            <script class="">w(hhmm(1656687480));</script>20:28
        </div>
    </div>
    <div class="d-flex flex-row flex-wrap signal-row">
        <div class="signal-title">
            Till 
        </div>
        <div class="ms-auto signal-value signal-color xh-highlight">
            <span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
            <script class="">w(hhmm(1656698280));</script>23:28
        </div>
    </div>
'''
soup = BeautifulSoup(html)

for e in soup.select('.card-body span'):
    print(e.parent.get_text(' ', strip=True))
Output Output
UTC +05:30 20:28
UTC +05:30 23:28

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM