如何使用 BeautifulSoup 从 hr 标签获取文本？

Question

This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):这是 HTML 的一个例子（我试着让它比实际看起来更整洁）：

<P>
random text
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span> 
<span style="font-size: 10px; margin-left: 20px;">
   <a style="color: #888; text-decoration: none;" title="Flag as offensive post"      
       href="/flag?a=248830&r=1">FLAG
   </a>
</span>

<hr> **THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>   
<span style="font-size: 10px; margin-left: 20px;">
    <a style="color: #888; text-decoration: none;" title="Flag as offensive post" 
       href="/flag?a=248830&r=2">FLAG
    </a>
</span>

<hr>**THIS IS THE TEXT I NEED**
<br>
<br>

<script type="text/javascript">

<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>

**THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i>

I'm trying to get the text from the hr tag.我正在尝试从hr标签中获取文本。 However, doing然而，做

for i in soup.find_all('hr'):
    print(i.text)

does not work.不起作用。 Instead, I get a blank output.相反，我得到一个空白的 output。

I've also tried我也试过

soup.find('i').previousSibling

but that outputs a blank, I'm not sure if that's because there's <br> <br> before.但是输出空白，我不确定那是不是因为之前有<br> <br> 。

How can I get the **THIS IS THE TEXT I NEED** ?我怎样才能得到**THIS IS THE TEXT I NEED** ？

Answer 1

The text you need isn't in an <hr> it's in a p.您需要的文本不在<hr>中，而是在 p 中。 So you can get it like this:所以你可以这样得到它：

soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())

Now considering that this prints:现在考虑打印：

random text


Anonymous
Nov 30 12:46pm

FLAG
   

 **THIS IS THE TEXT I NEED** 


Anonymous
Nov 30 3:40pm

FLAG
    

**THIS IS THE TEXT I NEED**




**THIS IS THE TEXT I NEED** 


Anonymous


Process finished with exit code 0

You'll need to parse out the text you need with something like:您需要使用以下内容解析出所需的文本：

import re

rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
    print(m)

Which prints out:打印出来：

**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis.但是您需要以其他方式找出您的文本，因为我怀疑它被星号包围了。 Edit: As a side not you can use soup.find instead of soup.findAll but I don't think that really matters.编辑：另一方面，您可以使用soup.find而不是soup.findAll ，但我认为这并不重要。

Answer 2

You could try just accessing the next element:您可以尝试访问下一个元素：

for hr in soup.find_all('hr'):
    print(hr.next_element.get_text(strip=True))

For your HTML this displays:对于您的 HTML，这会显示：

**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

如何使用 BeautifulSoup 从 hr 标签获取文本？

问题描述

2 个解决方案

解决方案1
1 2022-02-25 02:56:25

解决方案2
1 2022-02-25 09:58:20

如何使用 BeautifulSoup 从 hr 标签获取文本？

问题描述

2 个解决方案

解决方案1 1 2022-02-25 02:56:25

解决方案2 1 2022-02-25 09:58:20

解决方案1
1 2022-02-25 02:56:25

解决方案2
1 2022-02-25 09:58:20