简体   繁体   English

从变形的HTML标记中提取文本 <br> 分离元素

[英]extracting text from mangled html tag with <br> separating the elements

So I have this html piece: 所以我有这个HTML片:

<p class="tbtx">


                              MWF



<br></br>

TH
</p>

which is completely mangled it seems. 似乎完全被破坏了。 I need to extract the data ie ['MWF', 'TH']. 我需要提取数据,即['MWF','TH']。

The only solution I could think of is to replace all newlines and spaces in the html, then split it at 我唯一想到的解决方案是替换html中的所有换行符和空格,然后将其拆分为
and rebuild html structure and then extract .text but it's a bit ridiculous. 并重建html结构,然后解压缩.text,但这有点荒谬。

Any proper solutions for this? 有什么合适的解决方案吗?

.stripped_strings is what you are looking for - it removes unneccessary whitespace and returns the strings. .stripped_strings是您要寻找的-它删除了不必要的空格并返回字符串。

Demo: 演示:

from bs4 import BeautifulSoup

data = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

soup = BeautifulSoup(data)
print list(soup.stripped_strings)  # prints [u'MWF', u'TH']

You can do this using filter and BeautifulSoup to pull out just the text from your HTML snippet. 您可以使用filter和BeautifulSoup从HTML代码段中仅提取文本来执行此操作。

from bs4 import BeautifulSoup

html = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))

Outputs: 输出:

[u'MWF', u'TH']

I would recommend extracting text using Regular Expressions 我建议使用正则表达式提取文本

For instance if your html was as you noted: 例如,如果您的html如您所述:

"
<p class="tbtx">


                              MWF



<br></br>

TH
</p>
"

We can see that the desired text ("MWF","TH") is surround by whitespace characters. 我们可以看到所需的文本(“ MWF”,“ TH”)被空格字符包围。

Thus the regular expression("\\s\\w+\\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text. 因此,正则表达式(“ \\ s \\ w + \\ s”)读取“查找由空白字符包围的任何单词字符集”,并将标识所需的文本。

Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1 这是用于创建正则表达式的备忘单: http : //regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

And you can test your Regular Expression on desired text here: http://regexpal.com/ 您可以在此处在所需文本上测试正则表达式: http : //regexpal.com/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM