如何从html标签之间提取文本？

Question

I have a some html elements from which I want to extract the text. 我有一些html元素，我想从中提取文本。 So the html is like 所以html就像

<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>

where I want to extract the text as 我想将文本提取为

ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()

I found an answer to that issue here , but it does not work for me. 我在这里找到了该问题的答案，但是它对我不起作用。 Complete example code 完整的示例代码

from bs4 import BeautifulSoup as BSHTML

bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()

where I get the following error: 我收到以下错误：

Traceback (most recent call last):
  File "invest.py", line 13, in <module>
    print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'

Anything I am missing? 我有什么想念的吗？ Version of beautifulsoap : 4.6.0 版本的beautifulsoap ：4.6.0

Answer 1

Do you want all the text content of that pre block? 您是否需要该pre块的所有文本内容？

print bs.pre.text

Returns: 返回：

ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()

Answer 2

The .font in your code sample refers to the HTML tag <font> . 您的代码示例中的.font引用HTML标记<font> 。 Since you are instead looking to all the text from your document, you can use something like this: 由于您正在查找文档中的所有文本，因此可以使用以下内容：

contents = bs.find_all(text=True)
for c in contents:
    print(c)  # replace this with whatever you're trying to do

Output: 输出：

ZeroDivisionError
Traceback (most recent call last)

<ipython-input-2-0f9f90da76dc>
 in
<module>
()

Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags. 当前bs.font为None因为您正在解析不包含任何<font>标记的文档。

If you just want the contents as one long string, you can get that by just using bs.text 如果只想将内容作为一个长字符串，则只需使用bs.text

'\nZeroDivisionErrorTraceback (most recent call last)\n<ipython-input-2-0f9f90da76dc> in <module>()\n'

如何从html标签之间提取文本？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-11-22 10:55:08

解决方案2
0 2018-11-22 10:50:02

如何从html标签之间提取文本？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-11-22 10:55:08

解决方案2 0 2018-11-22 10:50:02

解决方案1
2 已采纳 2018-11-22 10:55:08

解决方案2
0 2018-11-22 10:50:02