pandas read_html() 忽略上标和下标

Question

我正在尝试从以下网站制作数据框： https ://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report = objectonly

如果您查看 Water 的列标题，则上标“a”是超链接，“b”表示蛋白质，因此我的数据框列标题最终为“Watera”和“Proteinb”。

我可以一一浏览并编辑它们，但是有没有办法以编程方式忽略下标和上标或超链接？

Answer 1

您可以在 BeautifulSoup 的帮助下删除<sup>标签，例如：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# remove <sup>
for sup in soup.select('sup'):
    sup.extract()

df = pd.read_html(str(soup))[0]
print(df)

印刷：

   Life StageGroup Total Water(L/d)  ... α-Linolenic Acid(g/d) Protein(g/d)
0          Infants              NaN  ...                   NaN          NaN
1           0–6 mo             0.7*  ...                  0.5*         9.1*
2          6–12 mo             0.8*  ...                  0.5*         11.0
3         Children              NaN  ...                   NaN          NaN
4            1–3 y             1.3*  ...                  0.7*           13
5            4–8 y             1.7*  ...                  0.9*           19
6            Males              NaN  ...                   NaN          NaN
7           9–13 y             2.4*  ...                  1.2*           34
8          14–18 y             3.3*  ...                  1.6*           52
9          19–30 y             3.7*  ...                  1.6*           56
10         31–50 y             3.7*  ...                  1.6*           56
11         51–70 y             3.7*  ...                  1.6*           56
12          > 70 y             3.7*  ...                  1.6*           56
13         Females              NaN  ...                   NaN          NaN
14          9–13 y             2.1*  ...                  1.0*           34
15         14–18 y             2.3*  ...                  1.1*           46
16         19–30 y             2.7*  ...                  1.1*           46
17         31–50 y             2.7*  ...                  1.1*           46
18         51–70 y             2.7*  ...                  1.1*           46
19          > 70 y             2.7*  ...                  1.1*           46
20       Pregnancy              NaN  ...                   NaN          NaN
21         14–18 y             3.0*  ...                  1.4*           71
22         19–30 y             3.0*  ...                  1.4*           71
23         31–50 y             3.0*  ...                  1.4*           71
24       Lactation              NaN  ...                   NaN          NaN
25           14–18             3.8*  ...                  1.3*           71
26         19–30 y             3.8*  ...                  1.3*           71
27         31–50 y             3.8*  ...                  1.3*           71

[28 rows x 8 columns]

pandas read_html() 忽略上标和下标

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-11 23:17:26

pandas read_html() 忽略上标和下标

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-11 23:17:26

解决方案1
2 已采纳 2020-10-11 23:17:26