如何從html標簽之間提取文本？

Question

我有一些html元素，我想從中提取文本。 所以html就像

<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>

我想將文本提取為

ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()

我在這里找到了該問題的答案，但是它對我不起作用。 完整的示例代碼

from bs4 import BeautifulSoup as BSHTML

bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()

我收到以下錯誤：

Traceback (most recent call last):
  File "invest.py", line 13, in <module>
    print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'

我有什么想念的嗎？ 版本的beautifulsoap ：4.6.0

Answer 1

您是否需要該pre塊的所有文本內容？

print bs.pre.text

返回：

ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()

Answer 2

您的代碼示例中的.font引用HTML標記<font> 。 由於您正在查找文檔中的所有文本，因此可以使用以下內容：

contents = bs.find_all(text=True)
for c in contents:
    print(c)  # replace this with whatever you're trying to do

輸出：

ZeroDivisionError
Traceback (most recent call last)

<ipython-input-2-0f9f90da76dc>
 in
<module>
()

當前bs.font為None因為您正在解析不包含任何<font>標記的文檔。

如果只想將內容作為一個長字符串，則只需使用bs.text

'\nZeroDivisionErrorTraceback (most recent call last)\n<ipython-input-2-0f9f90da76dc> in <module>()\n'

如何從html標簽之間提取文本？

問題描述

2 個解決方案

解決方案1
2 已采納 2018-11-22 10:55:08

解決方案2
0 2018-11-22 10:50:02

如何從html標簽之間提取文本？

問題描述

2 個解決方案

解決方案1 2 已采納 2018-11-22 10:55:08

解決方案2 0 2018-11-22 10:50:02

解決方案1
2 已采納 2018-11-22 10:55:08

解決方案2
0 2018-11-22 10:50:02