[英]extracting text from mangled html tag with <br> separating the elements
So I have this html piece: 所以我有这个HTML片:
<p class="tbtx">
MWF
<br></br>
TH
</p>
which is completely mangled it seems. 似乎完全被破坏了。 I need to extract the data ie ['MWF', 'TH'].
我需要提取数据,即['MWF','TH']。
The only solution I could think of is to replace all newlines and spaces in the html, then split it at 我唯一想到的解决方案是替换html中的所有换行符和空格,然后将其拆分为
and rebuild html structure and then extract .text but it's a bit ridiculous. 并重建html结构,然后解压缩.text,但这有点荒谬。
Any proper solutions for this? 有什么合适的解决方案吗?
.stripped_strings
is what you are looking for - it removes unneccessary whitespace and returns the strings. .stripped_strings
是您要寻找的-它删除了不必要的空格并返回字符串。
Demo: 演示:
from bs4 import BeautifulSoup
data = """<p class="tbtx">
MWF
<br></br>
TH
</p>"""
soup = BeautifulSoup(data)
print list(soup.stripped_strings) # prints [u'MWF', u'TH']
You can do this using filter
and BeautifulSoup to pull out just the text from your HTML snippet. 您可以使用
filter
和BeautifulSoup从HTML代码段中仅提取文本来执行此操作。
from bs4 import BeautifulSoup
html = """<p class="tbtx">
MWF
<br></br>
TH
</p>"""
print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))
Outputs: 输出:
[u'MWF', u'TH']
I would recommend extracting text using Regular Expressions 我建议使用正则表达式提取文本
For instance if your html was as you noted: 例如,如果您的html如您所述:
"
<p class="tbtx">
MWF
<br></br>
TH
</p>
"
We can see that the desired text ("MWF","TH") is surround by whitespace characters. 我们可以看到所需的文本(“ MWF”,“ TH”)被空格字符包围。
Thus the regular expression("\\s\\w+\\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text. 因此,正则表达式(“ \\ s \\ w + \\ s”)读取“查找由空白字符包围的任何单词字符集”,并将标识所需的文本。
Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1 这是用于创建正则表达式的备忘单: http : //regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
And you can test your Regular Expression on desired text here: http://regexpal.com/ 您可以在此处在所需文本上测试正则表达式: http : //regexpal.com/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.