繁体   English   中英

pandas read_html() 忽略上标和下标

[英]pandas read_html() ignore superscripts and subscripts

我正在尝试从以下网站制作数据框: https ://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report = objectonly

如果您查看 Water 的列标题,则上标“a”是超链接,“b”表示蛋白质,因此我的数据框列标题最终为“Watera”和“Proteinb”。

我可以一一浏览并编辑它们,但是有没有办法以编程方式忽略下标和上标或超链接?

您可以在 BeautifulSoup 的帮助下删除<sup>标签,例如:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# remove <sup>
for sup in soup.select('sup'):
    sup.extract()

df = pd.read_html(str(soup))[0]
print(df)

印刷:

   Life StageGroup Total Water(L/d)  ... α-Linolenic Acid(g/d) Protein(g/d)
0          Infants              NaN  ...                   NaN          NaN
1           0–6 mo             0.7*  ...                  0.5*         9.1*
2          6–12 mo             0.8*  ...                  0.5*         11.0
3         Children              NaN  ...                   NaN          NaN
4            1–3 y             1.3*  ...                  0.7*           13
5            4–8 y             1.7*  ...                  0.9*           19
6            Males              NaN  ...                   NaN          NaN
7           9–13 y             2.4*  ...                  1.2*           34
8          14–18 y             3.3*  ...                  1.6*           52
9          19–30 y             3.7*  ...                  1.6*           56
10         31–50 y             3.7*  ...                  1.6*           56
11         51–70 y             3.7*  ...                  1.6*           56
12          > 70 y             3.7*  ...                  1.6*           56
13         Females              NaN  ...                   NaN          NaN
14          9–13 y             2.1*  ...                  1.0*           34
15         14–18 y             2.3*  ...                  1.1*           46
16         19–30 y             2.7*  ...                  1.1*           46
17         31–50 y             2.7*  ...                  1.1*           46
18         51–70 y             2.7*  ...                  1.1*           46
19          > 70 y             2.7*  ...                  1.1*           46
20       Pregnancy              NaN  ...                   NaN          NaN
21         14–18 y             3.0*  ...                  1.4*           71
22         19–30 y             3.0*  ...                  1.4*           71
23         31–50 y             3.0*  ...                  1.4*           71
24       Lactation              NaN  ...                   NaN          NaN
25           14–18             3.8*  ...                  1.3*           71
26         19–30 y             3.8*  ...                  1.3*           71
27         31–50 y             3.8*  ...                  1.3*           71

[28 rows x 8 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM