简体   繁体   English

如何使用 BeautifulSoup 从 hr 标签获取文本?

[英]How to get text from hr tag using BeautifulSoup?

This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):这是 HTML 的一个例子(我试着让它比实际看起来更整洁):

<P>
random text
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span> 
<span style="font-size: 10px; margin-left: 20px;">
   <a style="color: #888; text-decoration: none;" title="Flag as offensive post"      
       href="/flag?a=248830&r=1">FLAG
   </a>
</span>

<hr> **THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>   
<span style="font-size: 10px; margin-left: 20px;">
    <a style="color: #888; text-decoration: none;" title="Flag as offensive post" 
       href="/flag?a=248830&r=2">FLAG
    </a>
</span>

<hr>**THIS IS THE TEXT I NEED**
<br>
<br>

<script type="text/javascript">

<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>

**THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 

I'm trying to get the text from the hr tag.我正在尝试从hr标签中获取文本。 However, doing然而,做

for i in soup.find_all('hr'):
    print(i.text)

does not work.不起作用。 Instead, I get a blank output.相反,我得到一个空白的 output。

I've also tried我也试过

soup.find('i').previousSibling

but that outputs a blank, I'm not sure if that's because there's <br> <br> before.但是输出空白,我不确定那是不是因为之前有<br> <br>

How can I get the **THIS IS THE TEXT I NEED** ?我怎样才能得到**THIS IS THE TEXT I NEED**

The text you need isn't in an <hr> it's in a p.您需要的文本不在<hr>中,而是在 p 中。 So you can get it like this:所以你可以这样得到它:

soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())

Now considering that this prints:现在考虑打印:

random text


Anonymous
Nov 30 12:46pm

FLAG
   

 **THIS IS THE TEXT I NEED** 


Anonymous
Nov 30 3:40pm

FLAG
    

**THIS IS THE TEXT I NEED**




**THIS IS THE TEXT I NEED** 


Anonymous


Process finished with exit code 0

You'll need to parse out the text you need with something like:您需要使用以下内容解析出所需的文本:

import re

rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
    print(m)

Which prints out:打印出来:

**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis.但是您需要以其他方式找出您的文本,因为我怀疑它被星号包围了。 Edit: As a side not you can use soup.find instead of soup.findAll but I don't think that really matters.编辑:另一方面,您可以使用soup.find而不是soup.findAll ,但我认为这并不重要。

You could try just accessing the next element:您可以尝试访问下一个元素:

for hr in soup.find_all('hr'):
    print(hr.next_element.get_text(strip=True))

For your HTML this displays:对于您的 HTML,这会显示:

**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM