简体   繁体   中英

Turning and HTML file into CSV using python

After a few days reading and searching across the internet...I've decided to ask here for help.

I have an HTML file that contains a table, and I need to turn this HTML file into CSV.

Small sample of my HTML file:

    <html>
<body>
<p class="timestamp">Fri 21 Jul 13:14:15 BST 2017
</p>

<h3>TAT Signal and TMH near C-terminus</h3>
<table>
<tr style = "background:#E7EBD8"><td>1</td><td>GCF_000688455.1_ASM68845v1_protein.faa.gz</td><td colspan = 4>Acidobacterium ailaaui</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MSRRTFVSSATAGLAALGALSSAAEGHAQLVWTSKNWKLAEFETLLREPARIRQVYDVTQ</td></tr>
<tr style = "background:#E7EBD8"><td>WP_026442391.1</td><td colspan = 5>hypothetical protein [Acidobacterium ailaaui]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Length: 233</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number of AAs in TMHs: 21.25002</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Exp number, first 60 AAs:  1.35114</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_026442391.1</td><td colspan = 4>Total prob of N-in:        0.67991</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>inside</td>
<td>1</td>
<td>201</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>TMhelix</td>
<td>202</td>
<td>224</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_026442391.1</td>
<td>WP_026442391.1</td>
<td>outside</td>
<td>225</td>
<td>233</td>
</tr>
<tr style = "background:#D8EBEA"><td>2</td><td>GCF_000022565.1_ASM2256v1_protein.faa.gz</td><td colspan = 4>Acidobacterium capsulatum ATCC 51196</td></tr>
<tr style = "background:#D8EBEA"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum</td></tr>
<tr style = "background:#D8EBEA"><td>First 60 AAs</td><td colspan = 5>MKSISRRSFVTTAAAGMAALGSLGPALPAAQGQAVEMASDWDISSFNQLAQSPARVKQLF</td></tr>
<tr style = "background:#D8EBEA"><td>WP_012680923.1</td><td colspan = 5>Tat pathway signal sequence domain-containing protein [Acidobacterium capsulatum]</td></tr>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Length: 237</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number of AAs in TMHs: 31.62059</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Exp number, first 60 AAs:  5.92535</td>
<tr style = "background:#D8EBEA"><td>TMHMM</td><td>WP_012680923.1</td><td colspan = 4>Total prob of N-in:        0.86701</td>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>inside</td>
<td>1</td>
<td>205</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>TMhelix</td>
<td>206</td>
<td>228</td>
</tr>
<tr style = "background:#D8EBEA">
<td>TMHMM</td>
<td>WP_012680923.1</td>
<td>WP_012680923.1</td>
<td>outside</td>
<td>229</td>
<td>237</td>
</tr>
<tr style = "background:#E7EBD8"><td>3</td><td>GCF_000014005.1_ASM1400v1_protein.faa.gz</td><td colspan = 4>Candidatus Koribacter versatilis Ellin345</td></tr>
<tr style = "background:#E7EBD8"><td>Taxonomy</td><td colspan = 5>Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis</td></tr>
<tr style = "background:#E7EBD8"><td>First 60 AAs</td><td colspan = 5>MGEKALMSKKPTIEEHLKATGVTRRSFVQLCGMLMAAAPIGLSLTSKASAQEVAKVVGKA</td></tr>
<tr style = "background:#E7EBD8"><td>WP_011525036.1</td><td colspan = 5>hydrogenase 2 small subunit [Candidatus Koribacter versatilis]</td></tr>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Length: 401</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Number of predicted TMHs:  1</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number of AAs in TMHs: 19.93057</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Exp number, first 60 AAs:  2.05251</td>
<tr style = "background:#E7EBD8"><td>TMHMM</td><td>WP_011525036.1</td><td colspan = 4>Total prob of N-in:        0.15168</td>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>outside</td>
<td>1</td>
<td>344</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>TMhelix</td>
<td>345</td>
<td>367</td>
</tr>
<tr style = "background:#E7EBD8">
<td>TMHMM</td>
<td>WP_011525036.1</td>
<td>WP_011525036.1</td>
<td>inside</td>
<td>368</td>
<td>401</td>
</tr>
</body>
</html>

I've tried this python script:

 import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("file.html")
bsObj = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table")
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

And I got this error:

Traceback (most recent call last):
  File "CleanTableTEST.py", line 18, in <module>
    rows = table.findAll("tr")
  File "/home/raven/.local/lib/python3.6/site-packages/bs4/element.py", line 2128, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

I have also tried this code:

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("file.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

with open("editors.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

And I got this error:

Traceback (most recent call last):
  File "CleanTable.py", line 17, in <module>
    table = soup.findAll("table", {"class":"wikitable"})[0]
IndexError: list index out of range

I have very little experience so this is the result of a few days of searching and copy-editing code...

Many thanks

Your first example is very close, you just need to replace findAll with find for the table, since you are only searching for one table rather than a list of tables.

If you change the line to the following, it should work as expected:

table = soup.find("table")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM