[英]How to get text from hr tag using BeautifulSoup?
This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):这是 HTML 的一个例子(我试着让它比实际看起来更整洁):
<P>
random text
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=1">FLAG
</a>
</span>
<hr> **THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=2">FLAG
</a>
</span>
<hr>**THIS IS THE TEXT I NEED**
<br>
<br>
<script type="text/javascript">
<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>
**THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
I'm trying to get the text from the hr
tag.我正在尝试从
hr
标签中获取文本。 However, doing然而,做
for i in soup.find_all('hr'):
print(i.text)
does not work.不起作用。 Instead, I get a blank output.
相反,我得到一个空白的 output。
I've also tried我也试过
soup.find('i').previousSibling
but that outputs a blank, I'm not sure if that's because there's <br> <br>
before.但是输出空白,我不确定那是不是因为之前有
<br> <br>
。
How can I get the **THIS IS THE TEXT I NEED**
?我怎样才能得到
**THIS IS THE TEXT I NEED**
?
The text you need isn't in an <hr>
it's in a p.您需要的文本不在
<hr>
中,而是在 p 中。 So you can get it like this:所以你可以这样得到它:
soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())
Now considering that this prints:现在考虑打印:
random text
Anonymous
Nov 30 12:46pm
FLAG
**THIS IS THE TEXT I NEED**
Anonymous
Nov 30 3:40pm
FLAG
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
Anonymous
Process finished with exit code 0
You'll need to parse out the text you need with something like:您需要使用以下内容解析出所需的文本:
import re
rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
print(m)
Which prints out:打印出来:
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis.但是您需要以其他方式找出您的文本,因为我怀疑它被星号包围了。 Edit: As a side not you can use
soup.find
instead of soup.findAll
but I don't think that really matters.编辑:另一方面,您可以使用
soup.find
而不是soup.findAll
,但我认为这并不重要。
You could try just accessing the next element:您可以尝试访问下一个元素:
for hr in soup.find_all('hr'):
print(hr.next_element.get_text(strip=True))
For your HTML this displays:对于您的 HTML,这会显示:
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.