从变形的HTML标记中提取文本 <br> 分离元素

Question

So I have this html piece: 所以我有这个HTML片：

<p class="tbtx">


                              MWF



<br></br>

TH
</p>

which is completely mangled it seems. 似乎完全被破坏了。 I need to extract the data ie ['MWF', 'TH']. 我需要提取数据，即['MWF'，'TH']。

The only solution I could think of is to replace all newlines and spaces in the html, then split it at 我唯一想到的解决方案是替换html中的所有换行符和空格，然后将其拆分为
and rebuild html structure and then extract .text but it's a bit ridiculous. 并重建html结构，然后解压缩.text，但这有点荒谬。

Any proper solutions for this? 有什么合适的解决方案吗？

Answer 1

.stripped_strings is what you are looking for - it removes unneccessary whitespace and returns the strings. .stripped_strings是您要寻找的-它删除了不必要的空格并返回字符串。

Demo: 演示：

from bs4 import BeautifulSoup

data = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

soup = BeautifulSoup(data)
print list(soup.stripped_strings)  # prints [u'MWF', u'TH']

Answer 2

You can do this using filter and BeautifulSoup to pull out just the text from your HTML snippet. 您可以使用filter和BeautifulSoup从HTML代码段中仅提取文本来执行此操作。

from bs4 import BeautifulSoup

html = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))

Outputs: 输出：

[u'MWF', u'TH']

Answer 3

I would recommend extracting text using Regular Expressions 我建议使用正则表达式提取文本

For instance if your html was as you noted: 例如，如果您的html如您所述：

"
<p class="tbtx">


                              MWF



<br></br>

TH
</p>
"

We can see that the desired text ("MWF","TH") is surround by whitespace characters. 我们可以看到所需的文本（“ MWF”，“ TH”）被空格字符包围。

Thus the regular expression("\\s\\w+\\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text. 因此，正则表达式（“ \\ s \\ w + \\ s”）读取“查找由空白字符包围的任何单词字符集”，并将标识所需的文本。

Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1 这是用于创建正则表达式的备忘单： http : //regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

And you can test your Regular Expression on desired text here: http://regexpal.com/ 您可以在此处在所需文本上测试正则表达式： http : //regexpal.com/

从变形的HTML标记中提取文本 <br> 分离元素

问题描述

3 个解决方案

解决方案1
3 已采纳 2014-07-24 15:08:16

解决方案2
1 2014-07-24 15:06:25

解决方案3
-3 2014-07-24 15:18:17

从变形的HTML标记中提取文本 <br> 分离元素

问题描述

3 个解决方案

解决方案1 3 已采纳 2014-07-24 15:08:16

解决方案2 1 2014-07-24 15:06:25

解决方案3 -3 2014-07-24 15:18:17

解决方案1
3 已采纳 2014-07-24 15:08:16

解决方案2
1 2014-07-24 15:06:25

解决方案3
-3 2014-07-24 15:18:17