使用beautifulsoup解析html表

Question

我试图从html下面获取表格，但没有成功。 早些时候我使用的是lxml，但是在来自Web链接的文本格式进行了一些更改之后，它失败了。 我对解析不了解。 感谢您的帮助/指针。

>>> text2='<html xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://mycompany.com/mynamespace" xmlns:filter="http://mycompany.com/myfilternamespace">\r\n  <head>\r\n    <META http-equiv="Content-Type" content="text/html; charset=utf-16">\r\n    <title>\r\n    </title>\r\n  </head>\r\n  <body>\r\n    <table border="1">\r\n      <tr>\r\n        <td>\r\n\t\t\t\t\tCOBDate\r\n\t\t\t\t\t</td>\r\n        <td>TOTAL</td>\r\n      </tr>\r\n      <tr>\r\n  <td>2013-6-12</td>\r\n        <td>-10000000</td>\r\n      </tr>\r\n    </table>\r\n  </body>\r\n</html>'
>>> soup=BeautifulSoup(text2)
>>> soup.findAll('table')
[]
>>> BeautifulSoup(text2, 'html.parser').find_all('table')
[<table border="1">
<tr>
<td>

                    COBDate

                    </td>
<td>TOTAL</td>
</tr>
<tr>
<td>2013-6-12</td>
<td>-10000000</td>
</tr>
</table>]

尽管BeautifulSoup（text2，'html.parser'）。find_all（'table'）返回了一个表，但下面的文本却没有发生这种情况，

>>> text1='<html xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://mycompany.com/mynamespace" xmlns:filter="http://mycompany.com/myfilternamespace">\r\n  <head>\r\n    <META http-equiv="Content-Type" content="text/html; charset=utf-16">\r\n    <title>\r\n    </title>\r\n  </head>\r\n  <body>\r\n    <table border="1">\r\n      <tr>\r\n        <td>\r\n\t\t\t\t\tCOBDate\r\n\t\t\t\t\t</td>\r\n        <td>TOTAL</td>\r\n      </tr>\r\n      <tr>\r\n        <td>2013-6-13</td>\r\n        <td>-1000000</td>\r\n      </tr>\r\n    </table>\r\n  </body>\r\n</html>'
>>> BeautifulSoup(text1, 'html.parser').find_all('table')
[]
>>> BeautifulSoup(text1).find_all('table')
[]

我已经更新了beautifulsoup，lxml和libxml2。 不确定是什么问题。

Answer 1

尽管这不是一个答案，但更多的是解决该问题的方法。 从上面的评论中，我知道这可能是软件包版本的问题。 我正在发布对我有用的解决方案，以防万一有人面临或将来可能面临类似问题。 对我有用的第一个：

from lxml import html

from bs4 import UnicodeDammit

doc = UnicodeDammit(text1, is_html=False)

parser = html.HTMLParser(encoding=doc.original_encoding)

root = html.document_fromstring(text1, parser=parser)

table = root.find('.//table')

另一个仅使用BeautifulSoup：

from bs4 import BeautifulSoup

BeautifulSoup(text1, 'xml')或BeautifulSoup(text1, ['lxml','xml'])

使用beautifulsoup解析html表

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-12-05 07:51:24

使用beautifulsoup解析html表

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-12-05 07:51:24

解决方案1
0 已采纳 2013-12-05 07:51:24