简体   繁体   English

如何用漂亮的汤从html表中提取数据

[英]how to extract data from html table with beautiful soup

How can I extract specific data from the followed table such as decay time 91.1 ms 5?如何从下表中提取特定数据,例如衰减时间 91.1 ms 5?

<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>

Here's a simple code to put that table into a pandas dataframe:这是将该表放入pandas数据帧的简单代码:

from bs4 import BeautifulSoup
import pandas as pd

page = """<table cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td>
    <td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>"""

soup = BeautifulSoup(page, "html.parser")
headers = soup.find('tr', {'class':'hp'}).findAll('td')
columns = []
for header in headers:
    columns.append(header.text)

data = []
data_raw = soup.findAll('tr',{'class':'cp'})
for row in data_raw:
    items = []
    for element in row.findAll('td'):
        items.append(element.text)
    data.append(items)

df = pd.DataFrame(data, columns=columns)

print(df['T1/2'])

Output is:输出是:

0    91.1 ms 5 
Name: T1/2, dtype: object

If what you have in Decay Modes are multiple rows you may have to add additional code to detect that (they are separated by <br> ), or if you can, correct the HTML to have different rows within different row tags and the header in a header tag如果您在衰减模式中拥有多行,您可能需要添加额外的代码来检测(它们由<br>分隔),或者如果可以,请更正 HTML 以在不同的行标签和标题中包含不同的行标题标签

您可以使用get_element_by_tag_name获取表并遍历每个内部标签并获取必要的数据。

Assuming you have the markup already in a string.假设您已经在字符串中添加了标记。 You have to find the elements by class (.cp), then you have to find by tag (td), and you can get the value of each found element using .text atribute, so use the following code:必须按类(.cp)查找元素,然后按标签(td)查找,可以使用.text属性获取每个找到的元素的值,所以使用如下代码:

import re
from bs4 import BeautifulSoup

html_doc = """<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>"""

soup = BeautifulSoup(html_doc, 'html.parser')
elements = soup.find_all(class_=re.compile("cp"))

for e in elements[0].find_all('td'):
    # the e.text contains the value of each td elements in your table
    print(e.text)

Ususally if I see a <table> tag, using pandas .read_html() is my first thing to try.通常,如果我看到<table>标签,我首先要尝试使用 pandas .read_html() It'll retunr a list of dataframes.它将返回数据帧列表。 It's then just a matter of selecting the dataframe and manipulating the dataframe to get it the way you want, or to pull the data you need:然后只需选择数据框并操作数据框以按照您想要的方式获取它,或者提取您需要的数据:

import pandas as pd


html = '''<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>'''

tables = pd.read_html(html)
df = tables[0]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:]

Output:输出:

print(df.loc[1,'T1/2'])
91.1 ms 5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM