I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
into
[['1,1', '1,2'],
['2,1', '2,2']]
Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
And what the code really looks like is even worse. All I really need out of there is:
<td rowspan="1">Sep 5</td>
Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
would end out like this:
[["Sep 5", "Some event"],
[None, "Some other event"]]
There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).
I was thinking of a process something like this:
len(table.findAll('tr'))
?) There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.
You'll probably need to identify the table with some attrs, id or name.
from BeautifulSoup import BeautifulSoup
data = """
<table>
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
</table>
"""
soup = BeautifulSoup(data)
for t in soup.findAll('table'):
for tr in t.findAll('tr'):
print [td.contents for td in tr.findAll('td')]
Edit: What should do the program if there're multiple links?
Ex:
<td><a href="#">A</a> B <a href="#">C</a></td>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.