Turning and HTML file into CSV using python

Question

After a few days reading and searching across the internet...I've decided to ask here for help.

I have an HTML file that contains a table, and I need to turn this HTML file into CSV.

Small sample of my HTML file:

    <html>
<body>
<p class="timestamp">Fri 21 Jul 13:14:15 BST 2017
</p>

<h3>TAT Signal and TMH near C-terminus</h3>
<table>
<tr style = "background:#E7EBD8"><td>1</td><td>GCF_000688455.1_ASM68845v1_protein.faa.gz</td><td colspan = 4>Acidobacterium ailaaui</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MSRRTFVSSATAGLAALGALSSAAEGHAQLVWTSKNWKLAEFETLLREPARIRQVYDVTQ</td></tr>
<tr style = "background:#E7EBD8"><td>WP_026442391.1</td><td colspan = 5>hypothetical protein [Acidobacterium ailaaui]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Length: 233</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number of AAs in TMHs: 21.25002</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number, first 60 AAs:  1.35114</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Total prob of N-in:        0.67991</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>inside</td>
<td>1</td>
<td>201</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>TMhelix</td>
<td>202</td>
<td>224</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>outside</td>
<td>225</td>
<td>233</td>
</tr>
<tr style = "background:#D8EBEA"><td>2</td><td>GCF_000022565.1_ASM2256v1_protein.faa.gz</td><td colspan = 4>Acidobacterium capsulatum ATCC 51196</td></tr>
<tr style = "background:#D8EBEA"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum</td></tr>
<tr style = "background:#D8EBEA"><td>First 60 AAs</td><td colspan = 5>MKSISRRSFVTTAAAGMAALGSLGPALPAAQGQAVEMASDWDISSFNQLAQSPARVKQLF</td></tr>
<tr style = "background:#D8EBEA"><td>WP_012680923.1</td><td colspan = 5>Tat pathway signal sequence domain-containing protein [Acidobacterium capsulatum]</td></tr>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Length: 237</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number of AAs in TMHs: 31.62059</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number, first 60 AAs:  5.92535</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Total prob of N-in:        0.86701</td>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>inside</td>
<td>1</td>
<td>205</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>TMhelix</td>
<td>206</td>
<td>228</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>outside</td>
<td>229</td>
<td>237</td>
</tr>
<tr style = "background:#E7EBD8"><td>3</td><td>GCF_000014005.1_ASM1400v1_protein.faa.gz</td><td colspan = 4>Candidatus Koribacter versatilis Ellin345</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MGEKALMSKKPTIEEHLKATGVTRRSFVQLCGMLMAAAPIGLSLTSKASAQEVAKVVGKA</td></tr>
<tr style = "background:#E7EBD8"><td>WP_011525036.1</td><td colspan = 5>hydrogenase 2 small subunit [Candidatus Koribacter versatilis]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Length: 401</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number of AAs in TMHs: 19.93057</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number, first 60 AAs:  2.05251</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Total prob of N-in:        0.15168</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>outside</td>
<td>1</td>
<td>344</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>TMhelix</td>
<td>345</td>
<td>367</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>inside</td>
<td>368</td>
<td>401</td>
</tr>
</body>
</html>

I've tried this python script:

 import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("file.html")
bsObj = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table")
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

And I got this error:

Traceback (most recent call last):
  File "CleanTableTEST.py", line 18, in <module>
    rows = table.findAll("tr")
  File "/home/raven/.local/lib/python3.6/site-packages/bs4/element.py", line 2128, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

I have also tried this code:

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("file.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

And I got this error:

Traceback (most recent call last):
  File "CleanTable.py", line 17, in <module>
    table = soup.findAll("table", {"class":"wikitable"})[0]
IndexError: list index out of range

I have very little experience so this is the result of a few days of searching and copy-editing code...

Many thanks

Answer 1

Your first example is very close, you just need to replace findAll with find for the table, since you are only searching for one table rather than a list of tables.

If you change the line to the following, it should work as expected:

table = soup.find("table")

Turning and HTML file into CSV using python

Question

1 answers

solution1
0 ACCPTED 2020-04-29 18:24:22

Turning and HTML file into CSV using python

Question

1 answers

solution1 0 ACCPTED 2020-04-29 18:24:22

solution1
0 ACCPTED 2020-04-29 18:24:22