简体   繁体   English

请以pythonic方式使用BeautifulSoup和lxml帮助解析此html表

[英]Please help parse this html table using BeautifulSoup and lxml the pythonic way

I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage. 我搜索了很多有关BeautifulSoup的内容,并建议使用lxml作为BeautifulSoup的未来,尽管这很有意义,但我很难从网页上的所有表列表中解析出下表。

I am interested in the three columns with varied number of rows depending on the page and the time it was checked. 我对三列感兴趣,这三列具有不同的行数,具体取决于页面和检查时间。 A BeautifulSoup and lxml solution is well appreciated. 一个BeautifulSoup和lxml解决方案受到高度赞赏。 That way I can ask the admin to install lxml on the dev. 这样,我可以要求管理员在开发人员上安装lxml。 machine. 机。

Desired output : 所需输出:

Website                    Last Visited          Last Loaded
http://google.com          01/14/2011 
http://stackoverflow.com   01/10/2011
...... more if present

Following is a code sample from a messy web page : 以下是凌乱网页中的代码示例:

<table border="2" width="100%">
  <tbody><tr>
    <td width="33%" class="BoldTD">Website</td>
    <td width="33%" class="BoldTD">Last Visited</td>
    <td width="34%" class="BoldTD">Last Loaded</td>
  </tr>
  <tr>
    <td width="33%">
      <a href="http://google.com"</a>
    </td>
    <td width="33%">01/14/2011
            </td>
    <td width="34%">
            </td>
  </tr>
  <tr>
    <td width="33%">
      <a href="http://stackoverflow.com"</a>
    </td>
    <td width="33%">01/10/2011
            </td>
    <td width="34%">
            </td>
  </tr>
</tbody></table>
>>> from lxml import html
>>> table_html = """"
...         <table border="2" width="100%">
...                       <tbody><tr>
...                         <td width="33%" class="BoldTD">Website</td>
...                         <td width="33%" class="BoldTD">Last Visited</td>
...                         <td width="34%" class="BoldTD">Last Loaded</td>
...                       </tr>
...                       <tr>
...                         <td width="33%">
...                           <a href="http://google.com"</a>
...                         </td>
...                         <td width="33%">01/14/2011
...                                 </td>
...                         <td width="34%">
...                                 </td>
...                       </tr>
...                       <tr>
...                         <td width="33%">
...                           <a href="http://stackoverflow.com"</a>
...                         </td>
...                         <td width="33%">01/10/2011
...                                 </td>
...                         <td width="34%">
...                                 </td>
...                       </tr>
...                     </tbody></table>"""
>>> table = html.fromstring(table_html)
>>> for row in table.xpath('//table[@border="2" and @width="100%"]/tbody/tr'):
...     for column in row.xpath('./td[position()=1]/a/@href | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
...             print column.strip(),
...     print
... 
Website Last Visited Last Loaded
 http://google.com 01/14/2011 
 http://stackoverflow.com 01/10/2011 
>>> 

voila;) of course instead of printing you can add your values to nested lists or dicts;) 瞧!)当然,您可以将值添加到嵌套列表或字典中,而不用打印;)

Here's a version that uses elementtree and the limited XPath it provides: 这是一个使用elementtree及其提供的有限XPath的版本:

from xml.etree.ElementTree import ElementTree

doc = ElementTree().parse('table.html')

for t in doc.findall('.//table'):
  # there may be multiple tables, check we have the right one
  if t.find('./tbody/tr/td').text == 'Website':
    for tr in t.findall('./tbody/tr/')[1:]: # skip the header row
      tds = tr.findall('./td')
      print tds[0][0].attrib['href'], tds[1].text.strip(), tds[2].text.strip()

Results: 结果:

http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011 

Here's a version that uses HTMLParser. 这是使用HTMLParser的版本。 I tried against the contents of pastebin.com/tu7dfeRJ . 我尝试了pastebin.com/tu7dfeRJ的内容。 It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version. 它处理meta标签和doctype声明,两者都挫败了ElementTree版本。

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.line = ""
    self.in_tr = False
    self.in_table = False

  def handle_starttag(self, tag, attrs):
    if self.in_table and tag == "tr":
      self.line = ""
      self.in_tr = True
    if tag=='a':
     for attr in attrs:
       if attr[0] == 'href':
         self.line += attr[1] + " "

  def handle_endtag(self, tag):
    if tag == 'tr':
      self.in_tr = False
      if len(self.line):
        print self.line
    elif tag == "table":
      self.in_table = False

  def handle_data(self, data):
    if data == "Website":
      self.in_table = 1
    elif self.in_tr:
      data = data.strip()
      if data:
        self.line += data.strip() + " "

if __name__ == '__main__':
  myp = MyParser()
  myp.feed(open('table.html').read())

Hopefully this addresses everything you need and you can accept this as the answer. 希望这可以满足您的所有需求,并且您可以接受此答案。 Updated as requested. 根据要求更新。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM