简体   繁体   中英

convert html table to csv using python

I have a string which contains the source code of a html file extracted through mechanize library. The html file will always contain a table like this. I want to convert the table to CSV Format .

Several SO questions which address the same problem have the table with a class name. But my table doesnt have a class attribute. So what should i do...?

<table border=1 cellPadding="2" cellSpacing="0" width="75%"  bordercolor="#000000" >

  <tr bgcolor="mediumblue">
    <td width="20%"><p align="center"><font face="Arial" color="white" size="2"><strong>SUB CODE</strong></font></p></td>
    <td width="26%"><p align="left"><font face="Arial" color="white" size="2"><strong>SUB NAME</strong></font></p></td>
    <td width="13%"><p align="left"><font face="Arial" color="white" size="2"><strong>THEORY</strong></font></p>  </td>
    <td width="10%"><p align="left"><font face="Arial" color="white" size="2"><strong>PRACTICAL</strong></font></p> </td>
    <td width="17%"><p align="left"><font face="Arial" color="white" size="2"><strong>MARKS</strong></font></p></td>
    <td width="14%"><p align="center"><font face="Arial" color="white" size="2"><strong>GRADE</strong></font></p></td>
  </tr>


  <tr bgColor="#ffffff">
    <td align="middle"><font face="Arial" size=2> 301</font></td>
    <td align="left" ><font face="Arial" size=2>ENGLISH CORE</font></td>
    <td align="left" ><font face="Arial" size=2>067</font></td>
    <td align="left" ><font face="Arial" size=2></font></td>
    <td align="left" ><font face="Arial" size=2>067&nbsp;&nbsp;&nbsp;&nbsp;</font></td>
    <td align="middle"><font face="Arial" size=2>C2</font></td>
  </tr>

  </table>

pandas has a neat way to read html tables .

import pandas as pd

html_data = '''
<table border=1 cellPadding="2" cellSpacing="0" width="75%"  bordercolor="#000000" >

  <tr bgcolor="mediumblue">
    <td width="20%"><p align="center"><font face="Arial" color="white" size="2"><strong>SUB CODE</strong></font></p></td>
    <td width="26%"><p align="left"><font face="Arial" color="white" size="2"><strong>SUB NAME</strong></font></p></td>
    <td width="13%"><p align="left"><font face="Arial" color="white" size="2"><strong>THEORY</strong></font></p>  </td>
    <td width="10%"><p align="left"><font face="Arial" color="white" size="2"><strong>PRACTICAL</strong></font></p> </td>
    <td width="17%"><p align="left"><font face="Arial" color="white" size="2"><strong>MARKS</strong></font></p></td>
    <td width="14%"><p align="center"><font face="Arial" color="white" size="2"><strong>GRADE</strong></font></p></td>
  </tr>


  <tr bgColor="#ffffff">
    <td align="middle"><font face="Arial" size=2> 301</font></td>
    <td align="left" ><font face="Arial" size=2>ENGLISH CORE</font></td>
    <td align="left" ><font face="Arial" size=2>067</font></td>
    <td align="left" ><font face="Arial" size=2></font></td>
    <td align="left" ><font face="Arial" size=2>067&nbsp;&nbsp;&nbsp;&nbsp;</font></td>
    <td align="middle"><font face="Arial" size=2>C2</font></td>
  </tr>

  </table>
'''

print pd.read_html(html_data)[0].to_csv(index=False, header=False)

When where's multiple tables in html, you can check column names of the table, to remove unneeded ones.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM