[英]pandas read_html() ignore superscripts and subscripts
您可以在 BeautifulSoup 的帮助下删除<sup>
标签,例如:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# remove <sup>
for sup in soup.select('sup'):
sup.extract()
df = pd.read_html(str(soup))[0]
print(df)
印刷:
Life StageGroup Total Water(L/d) ... α-Linolenic Acid(g/d) Protein(g/d)
0 Infants NaN ... NaN NaN
1 0–6 mo 0.7* ... 0.5* 9.1*
2 6–12 mo 0.8* ... 0.5* 11.0
3 Children NaN ... NaN NaN
4 1–3 y 1.3* ... 0.7* 13
5 4–8 y 1.7* ... 0.9* 19
6 Males NaN ... NaN NaN
7 9–13 y 2.4* ... 1.2* 34
8 14–18 y 3.3* ... 1.6* 52
9 19–30 y 3.7* ... 1.6* 56
10 31–50 y 3.7* ... 1.6* 56
11 51–70 y 3.7* ... 1.6* 56
12 > 70 y 3.7* ... 1.6* 56
13 Females NaN ... NaN NaN
14 9–13 y 2.1* ... 1.0* 34
15 14–18 y 2.3* ... 1.1* 46
16 19–30 y 2.7* ... 1.1* 46
17 31–50 y 2.7* ... 1.1* 46
18 51–70 y 2.7* ... 1.1* 46
19 > 70 y 2.7* ... 1.1* 46
20 Pregnancy NaN ... NaN NaN
21 14–18 y 3.0* ... 1.4* 71
22 19–30 y 3.0* ... 1.4* 71
23 31–50 y 3.0* ... 1.4* 71
24 Lactation NaN ... NaN NaN
25 14–18 3.8* ... 1.3* 71
26 19–30 y 3.8* ... 1.3* 71
27 31–50 y 3.8* ... 1.3* 71
[28 rows x 8 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.