I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html
, each of these nested tables are parsed and then "inserted"/"concatenated" as rows.
I'd like these nested tables to each be parsed into their own pd.DataFrames
and the inserted as objects as the value of the corresponding column.
If this is not possible, having raw html for the nested table as a string in the corresponding position would be fine.
Code as tested:
import pandas as pd
df_up = pd.read_html("up_pf00344.test.html", attrs = {'id': 'results'})
Screenshot of table as rendered in html:
Link to file: https://gist.github.com/smsaladi/6adb30efbe70f9fed0306b226e8ad0d8#file-up_pf00344-test-html-L62
You can't use read_html
to read nested tables, but you can roll your own html reader and use read_html
for table cells:
import pandas as pd
import bs4
with open('up_pf00344.test.html') as f:
html = f.read()
soup = bs4.BeautifulSoup(html, 'lxml')
results = soup.find(attrs = {'id': 'results'})
# get first visible header row as dataframe headers
for row in results.thead.find_all('tr'):
if 'display:none' not in row.get('style',''):
df = pd.DataFrame(columns=[col.get_text() for col in row.find_all('th')])
break
# append all table rows to dataframe
for row in results.tbody.find_all('tr', recursive=False):
if 'display:none' in row.get('style',''):
continue
df_row = []
for col in row.find_all('td', recursive=False):
table = col.find_all('table')
df_row.append(pd.read_html(str(col))[0] if table else col.get_text())
df.loc[len(df)] = df_row
Result of df.iloc[0].map(type)
:
<class 'str'>
Entry <class 'str'>
Organism <class 'str'>
Protein names <class 'str'>
Gene names <class 'str'>
Length <class 'str'>
Cross-reference (Pfam) <class 'str'>
Cross-reference (InterPro) <class 'str'>
Taxonomic lineage IDs <class 'str'>
Subcellular location [CC] <class 'str'>
Signal peptide <class 'str'>
Transit peptide <class 'str'>
Topological domain <class 'pandas.core.frame.DataFrame'>
Transmembrane <class 'pandas.core.frame.DataFrame'>
Intramembrane <class 'pandas.core.frame.DataFrame'>
Sequence caution <class 'str'>
Caution <class 'str'>
Taxonomic lineage (SUPERKINGDOM) <class 'str'>
Taxonomic lineage (KINGDOM) <class 'str'>
Taxonomic lineage (PHYLUM) <class 'str'>
Cross-reference (RefSeq) <class 'str'>
Cross-reference (EMBL) <class 'str'>
e <class 'str'>
Bonus: As your table rows have an id
, you could use it as index of your dataframe df.loc[row.get('id')] = df_row
instead of df.loc[len(df)] = df_row
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.