[英]beautifulsoup not parsing html correctly
So I have the following code : 所以我有以下代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")
print soup.getText()
But the output is empty, yet with other html samples it works just fine. 但是输出为空,但是与其他html示例一起使用也可以。 The html is like that because it is extracted from a table. html之所以这样,是因为它是从表中提取的。
html = '<p>Content</p></td></table>'
That works just fine for example. 例如,那很好。 Any help? 有什么帮助吗?
Edit: I know the HTML is not valid, but the second HTML sample is also invalid yet that works. 编辑:我知道HTML是无效的,但第二个HTML示例也是无效的,但可以。
It's because lxml
is having trouble parsing invalid HTML
. 这是因为lxml
无法解析无效的HTML
。
Use html.parser
instead of lxml
. 使用html.parser
而不是lxml
。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')
print soup.getText()
Output: 输出:
Data I want Data I want Data I want
if the consistent issue is missing the opening tag you can use regular expression to find what it should be like the below 如果一致的问题缺少开始标记,则可以使用正则表达式查找如下内容
from bs4 import BeautifulSoup
import re
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
html = L[len(L)-1].replace('/','') + html
soup = BeautifulSoup(html, "lxml")
print soup.getText()
output is 输出是
Data I want Data I want Data I want
What you have there is not a valid HTML. 您所拥有的没有有效的HTML。 Why don't you change it to the following? 为什么不将其更改为以下内容?
html = '<table><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
But there is probably something missing before the sample you posted. 但是在您发布的示例之前可能缺少一些东西。 Where does the HTML code come from? HTML代码来自哪里?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.