Please help parse this html table using BeautifulSoup and lxml the pythonic way

Question

I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.

I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.

Desired output :

Website                    Last Visited          Last Loaded
http://google.com          01/14/2011 
http://stackoverflow.com   01/10/2011
...... more if present

Following is a code sample from a messy web page :

<table border="2" width="100%">
  <tbody><tr>
    <td width="33%" class="BoldTD">Website</td>
    <td width="33%" class="BoldTD">Last Visited</td>
    <td width="34%" class="BoldTD">Last Loaded</td>
  </tr>
  <tr>
    <td width="33%">
      <a href="http://google.com"</a>
    </td>
    <td width="33%">01/14/2011
            </td>
    <td width="34%">
            </td>
  </tr>
  <tr>
    <td width="33%">
      <a href="http://stackoverflow.com"</a>
    </td>
    <td width="33%">01/10/2011
            </td>
    <td width="34%">
            </td>
  </tr>
</tbody></table>

Answer 1

>>> from lxml import html
>>> table_html = """"
...         <table border="2" width="100%">
...                       <tbody><tr>
...                         <td width="33%" class="BoldTD">Website</td>
...                         <td width="33%" class="BoldTD">Last Visited</td>
...                         <td width="34%" class="BoldTD">Last Loaded</td>
...                       </tr>
...                       <tr>
...                         <td width="33%">
...                           <a href="http://google.com"</a>
...                         </td>
...                         <td width="33%">01/14/2011
...                                 </td>
...                         <td width="34%">
...                                 </td>
...                       </tr>
...                       <tr>
...                         <td width="33%">
...                           <a href="http://stackoverflow.com"</a>
...                         </td>
...                         <td width="33%">01/10/2011
...                                 </td>
...                         <td width="34%">
...                                 </td>
...                       </tr>
...                     </tbody></table>"""
>>> table = html.fromstring(table_html)
>>> for row in table.xpath('//table[@border="2" and @width="100%"]/tbody/tr'):
...     for column in row.xpath('./td[position()=1]/a/@href | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
...             print column.strip(),
...     print
... 
Website Last Visited Last Loaded
 http://google.com 01/14/2011 
 http://stackoverflow.com 01/10/2011 
>>>

voila;) of course instead of printing you can add your values to nested lists or dicts;)

Answer 2

Here's a version that uses elementtree and the limited XPath it provides:

from xml.etree.ElementTree import ElementTree

doc = ElementTree().parse('table.html')

for t in doc.findall('.//table'):
  # there may be multiple tables, check we have the right one
  if t.find('./tbody/tr/td').text == 'Website':
    for tr in t.findall('./tbody/tr/')[1:]: # skip the header row
      tds = tr.findall('./td')
      print tds[0][0].attrib['href'], tds[1].text.strip(), tds[2].text.strip()

Results:

http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011

Answer 3

Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ . It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.line = ""
    self.in_tr = False
    self.in_table = False

  def handle_starttag(self, tag, attrs):
    if self.in_table and tag == "tr":
      self.line = ""
      self.in_tr = True
    if tag=='a':
     for attr in attrs:
       if attr[0] == 'href':
         self.line += attr[1] + " "

  def handle_endtag(self, tag):
    if tag == 'tr':
      self.in_tr = False
      if len(self.line):
        print self.line
    elif tag == "table":
      self.in_table = False

  def handle_data(self, data):
    if data == "Website":
      self.in_table = 1
    elif self.in_tr:
      data = data.strip()
      if data:
        self.line += data.strip() + " "

if __name__ == '__main__':
  myp = MyParser()
  myp.feed(open('table.html').read())

Hopefully this addresses everything you need and you can accept this as the answer. Updated as requested.

Please help parse this html table using BeautifulSoup and lxml the pythonic way

Question

3 answers

solution1
4 2011-01-21 18:01:18

solution2
3 2011-01-21 20:24:17

solution3
2 ACCPTED 2011-01-25 21:47:13

Please help parse this html table using BeautifulSoup and lxml the pythonic way

Question

3 answers

solution1 4 2011-01-21 18:01:18

solution2 3 2011-01-21 20:24:17

solution3 2 ACCPTED 2011-01-25 21:47:13

solution1
4 2011-01-21 18:01:18

solution2
3 2011-01-21 20:24:17

solution3
2 ACCPTED 2011-01-25 21:47:13