之后提取文字 <hr> 在BeautifulSoup中标记

Question

I have a script which extracts data from a page. 我有一个脚本，可以从页面中提取数据。 I can scrape most of it but there is a bit of text that occurs after a "hr" tag; 我可以抓取大部分内容，但是在“ hr”标记之后会出现一些文本； which I'm not sure how to scrape. 我不确定该如何抓取。 The HTML code is as follows: HTML代码如下：

<div itemprop="articleBody" class="article-body">
            <p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
            <h1 itemprop="headline">HEADLINE</h1>
            <div class="hidden-lg meta">
                <ul>
                    <li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
                    <li>News Release</li>

                    <li>Release No: NR-014-19</li>

                </ul>
            </div>

                <hr>

Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text  <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p>&nbsp;</p>
<p>E Text Text Text </p>

            </div>

How do I extract the text after the hr tag until the end of the div tag? 如何提取hr标记后直到div标记结束的文本？ For the other elements I used something like: 对于其他元素，我使用了类似的方法：

    for meta in soup.find_all('div',class_='hidden-lg meta'):
    data = meta.text.splitlines()

    d['date'] = data[2]
    d['type'] = data[3]
    d['release'] = data[4]

Answer 1

It's a little bit tricky and seems like a workaround but you can use the next_sibling attribute of bs4 element and test the type . 这有点棘手，似乎是一种解决方法，但是您可以使用bs4元素的next_sibling属性并测试type 。 But it works: 但它有效：

from urllib.request import urlopen
import bs4
import requests
import json
from selenium import webdriver

html = """<div itemprop="articleBody" class="article-body">
            <p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
            <h1 itemprop="headline">HEADLINE</h1>
            <div class="hidden-lg meta">
                <ul>
                    <li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
                    <li>News Release</li>

                    <li>Release No: NR-014-19</li>

                </ul>
            </div>

                <hr>

Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text  <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p>&nbsp;</p>
<p>E Text Text Text </p>

            </div>"""

soup = bs4.BeautifulSoup(html,'html.parser')
div = soup.find('div')
text =  ''
el = div.find('hr')
while(el):
    el = el.next_sibling
    if isinstance(el, bs4.element.Tag):
        text += el.get_text()
    elif isinstance(el, bs4.element.NavigableString):
        text += el

print(text)

OUTPUT: OUTPUT：

Text Text Text 

Text Text Text 

Text Text Text.

Text Text Text   Text Text Text .
 
E Text Text Text

之后提取文字 <hr> 在BeautifulSoup中标记

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-03-11 16:25:01

之后提取文字 <hr> 在BeautifulSoup中标记

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-03-11 16:25:01

解决方案1
1 已采纳 2019-03-11 16:25:01