beautifulsoup无法正确解析html

Question

So I have the following code : 所以我有以下代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")

print soup.getText()

But the output is empty, yet with other html samples it works just fine. 但是输出为空，但是与其他html示例一起使用也可以。 The html is like that because it is extracted from a table. html之所以这样，是因为它是从表中提取的。

html = '<p>Content</p></td></table>'

That works just fine for example. 例如，那很好。 Any help? 有什么帮助吗？

Edit: I know the HTML is not valid, but the second HTML sample is also invalid yet that works. 编辑：我知道HTML是无效的，但第二个HTML示例也是无效的，但可以。

Answer 1

It's because lxml is having trouble parsing invalid HTML . 这是因为lxml无法解析无效的HTML 。

Use html.parser instead of lxml . 使用html.parser而不是lxml 。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')

print soup.getText()

Output: 输出：

 Data I want      Data I want       Data I want

Answer 2

if the consistent issue is missing the opening tag you can use regular expression to find what it should be like the below 如果一致的问题缺少开始标记，则可以使用正则表达式查找如下内容

from bs4 import BeautifulSoup
import re

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
    html = L[len(L)-1].replace('/','') + html

soup = BeautifulSoup(html, "lxml")
print soup.getText()

output is 输出是

Data I want      Data I want       Data I want

Answer 3

What you have there is not a valid HTML. 您所拥有的没有有效的HTML。 Why don't you change it to the following? 为什么不将其更改为以下内容？

html = '<table><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'

But there is probably something missing before the sample you posted. 但是在您发布的示例之前可能缺少一些东西。 Where does the HTML code come from? HTML代码来自哪里？

beautifulsoup无法正确解析html

问题描述

3 个解决方案

解决方案1
3 2016-02-18 17:41:36

解决方案2
2 2016-02-18 16:23:29

解决方案3
0 2016-02-18 14:47:09

beautifulsoup无法正确解析html

问题描述

3 个解决方案

解决方案1 3 2016-02-18 17:41:36

解决方案2 2 2016-02-18 16:23:29

解决方案3 0 2016-02-18 14:47:09

解决方案1
3 2016-02-18 17:41:36

解决方案2
2 2016-02-18 16:23:29

解决方案3
0 2016-02-18 14:47:09