简体   繁体   English

beautifulsoup无法正确解析html

[英]beautifulsoup not parsing html correctly

So I have the following code : 所以我有以下代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")

print soup.getText()

But the output is empty, yet with other html samples it works just fine. 但是输出为空,但是与其他html示例一起使用也可以。 The html is like that because it is extracted from a table. html之所以这样,是因为它是从表中提取的。

html = '<p>Content</p></td></table>'

That works just fine for example. 例如,那很好。 Any help? 有什么帮助吗?

Edit: I know the HTML is not valid, but the second HTML sample is also invalid yet that works. 编辑:我知道HTML是无效的,但第二个HTML示例也是无效的,但可以。

It's because lxml is having trouble parsing invalid HTML . 这是因为lxml无法解析无效的HTML

Use html.parser instead of lxml . 使用html.parser而不是lxml

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')

print soup.getText()

Output: 输出:

 Data I want      Data I want       Data I want   

if the consistent issue is missing the opening tag you can use regular expression to find what it should be like the below 如果一致的问题缺少开始标记,则可以使用正则表达式查找如下内容

from bs4 import BeautifulSoup
import re

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
    html = L[len(L)-1].replace('/','') + html

soup = BeautifulSoup(html, "lxml")
print soup.getText()

output is 输出是

Data I want      Data I want       Data I want 

What you have there is not a valid HTML. 您所拥有的没有有效的HTML。 Why don't you change it to the following? 为什么不将其更改为以下内容?

html = '<table><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'

But there is probably something missing before the sample you posted. 但是在您发布的示例之前可能缺少一些东西。 Where does the HTML code come from? HTML代码来自哪里?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM