Pandas read_html only finds header of a html table

Question

I have this table:

I parsed it using pandas:

s = '<table id="datatable"><tr><th onclick="sortTable(0)">Gene locus</th><th onclick="sortTable(1)">Organism</th><th onclick="sortTable(2)">Found in</th><th onclick="sortTable(3)">Gene name</th><th onclick="sortTable(4)">AA mutation</th><th onclick="sortTable(5)">Drug</th><th onclick="sortTable(6)">Tandem repeat name</th><th onclick="sortTable(7)">Tandem repeat sequence</th><th onclick="sortTable(8)">Reference</th></tr><td>ASPNIDRAFT_55947</td><td>Aspergillus niger</td><td>Animal - Human</td><td>CYP51a</td><td>R228Q </td><td>Posaconazole</td><td></td><td><div style="word-wrap: break-word;max-width: 250px;"></div></td><td><a href="http://jcm.asm.org/content/54/9/2365.full">10.1128/JCM.01075-16</a></td></tr></table>'
table = pandas.read_html(s)[0]
print(table)

However this gives me:

Empty DataFrame
Columns: [Gene locus, Organism, Found in, Gene name, AA mutation, Drug, Tandem repeat name, Tandem repeat sequence, Reference]
Index: []

There is clearly a filled row ( <tr>... ) beneath the header ( <th>.. ) so I can't figure out where it goes wrong, and more importantly how I can read the table properly.

(Ps I can't acces Imgur from the country I'm in now, so feel free to change it if the link is inappropriate or tell me how I can change it)

Answer 1

You are missing a <tr> before the first <td>

Here is the correct string,

s = '<table id="datatable"><tr><th onclick="sortTable(0)">Gene locus</th><th onclick="sortTable(1)">Organism</th><th onclick="sortTable(2)">Found in</th><th onclick="sortTable(3)">Gene name</th><th onclick="sortTable(4)">AA mutation</th><th onclick="sortTable(5)">Drug</th><th onclick="sortTable(6)">Tandem repeat name</th><th onclick="sortTable(7)">Tandem repeat sequence</th><th onclick="sortTable(8)">Reference</th></tr><tr><td>ASPNIDRAFT_55947</td><td>Aspergillus niger</td><td>Animal - Human</td><td>CYP51a</td><td>R228Q </td><td>Posaconazole</td><td></td><td><div style="word-wrap: break-word;max-width: 250px;"></div></td><td><a href="http://jcm.asm.org/content/54/9/2365.full">10.1128/JCM.01075-16</a></td></tr></table>'

It works now.

Answer 2

Fixed:

s = '<table id="datatable"><tr><th onclick="sortTable(0)">Gene locus</th><th onclick="sortTable(1)">Organism</th><th onclick="sortTable(2)">Found in</th><th onclick="sortTable(3)">Gene name</th><th onclick="sortTable(4)">AA mutation</th><th onclick="sortTable(5)">Drug</th><th onclick="sortTable(6)">Tandem repeat name</th><th onclick="sortTable(7)">Tandem repeat sequence</th><th onclick="sortTable(8)">Reference</th></tr><tr><td>ASPNIDRAFT_55947</td><td>Aspergillus niger</td><td>Animal - Human</td><td>CYP51a</td><td>R228Q </td><td>Posaconazole</td><td></td><td><div style="word-wrap: break-word;max-width: 250px;"></div></td><td><a href="http://jcm.asm.org/content/54/9/2365.full">10.1128/JCM.01075-16</a></td></tr></table>'
table = pandas.read_html(s)[0]
print(table)

You were missing a <tr> tag after first </tr> tag.

Output:

  Gene locus ... Reference 0 ASPNIDRAFT_55947 ... 10.1128/JCM.01075-16 [1 rows x 9 columns]

Pandas read_html only finds header of a html table

Question

2 answers

solution1
1 ACCPTED 2019-08-17 18:50:04

solution2
1 2019-08-17 18:50:38

Pandas read_html only finds header of a html table

Question

2 answers

solution1 1 ACCPTED 2019-08-17 18:50:04

solution2 1 2019-08-17 18:50:38

solution1
1 ACCPTED 2019-08-17 18:50:04

solution2
1 2019-08-17 18:50:38