BeautifulSoup or regex HTML table to data structure?

Question

I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like

<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>

into

[['1,1', '1,2'],
 ['2,1', '2,2']]

Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:

    <td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&amp;style=L&amp;positioning=A&amp;adddirect=yes&amp;accessid=CreateNewEdit&amp;filterblock=N&amp;popeditform=yes&amp;returncalendar=student_center/sc_all_rooms')"
     class="listdefaultmonthbg" 
     style="cursor:crosshair;" 
     width="5%" 
     nowrap="1" 
     rowspan="1">
       <a class="listdatelink" 
          href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&amp;display=W&amp;positioning=A&amp;filterblock=N&amp;adddirect=yes&amp;accessid=CreateNewEdit">Sep 5</a>
    </td>

And what the code really looks like is even worse. All I really need out of there is:

<td rowspan="1">Sep 5</td>

Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:

<tr>
  <td rowspan="2">Sep 5</td>
  <td>Some event</td>
</tr>
<tr>
  <td>Some other event</td>
</tr>

would end out like this:

[["Sep 5", "Some event"],
 [None, "Some other event"]]

There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).

I was thinking of a process something like this:

Find table
Count rows in tables ( len(table.findAll('tr')) ?)
Create list
Parse table into list (BeautifulSoup syntax???)
???
Profit! (Well, it's a purely internal program, so not really... )

Answer 1

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

Answer 2

You'll probably need to identify the table with some attrs, id or name.

from BeautifulSoup import BeautifulSoup

data = """
<table>
<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>
</table>
"""

soup = BeautifulSoup(data)

for t in soup.findAll('table'):
    for tr in t.findAll('tr'):
        print [td.contents for td in tr.findAll('td')]

Edit: What should do the program if there're multiple links?

Ex:

<td><a href="#">A</a> B <a href="#">C</a></td>

BeautifulSoup or regex HTML table to data structure?

Question

2 answers

solution1
2 ACCPTED 2010-09-16 14:41:34

solution2
0 2010-09-16 15:41:10

BeautifulSoup or regex HTML table to data structure?

Question

2 answers

solution1 2 ACCPTED 2010-09-16 14:41:34

solution2 0 2010-09-16 15:41:10

solution1
2 ACCPTED 2010-09-16 14:41:34

solution2
0 2010-09-16 15:41:10