从带有Python的HTML表中仅提取一列数据？

Question

我正在尝试为我正在做的一个小项目提取一些NBA统计信息，并且我只需要从HTML表格中提取几列（垂直向上和向下）数据，例如此处的数据。 我现在只想获取PTS，那么我应该如何只提取那一列数据呢？ 我已经知道它是每个数据行的倒数第三个元素，但是我不确定应该如何解析数据。

Answer 1

我建议您阅读整个html表，然后选择所需的列。 也许您会失去一些速度，但会获得更多的简单性。

使用pandas的read_html函数很容易做到：

import urllib2
import pandas as pd

page1 = urllib2.urlopen(
    'http://www.basketball-reference.com/players/h/hardeja01/gamelog/2015/').read()

#Select the correct table by some attributes, in this case id=pgl_basic.
#The read_html function returns a list of tables.
#In this case we select the first (and only) table with this id
stat_table = pd.io.html.read_html(page1,attrs={'id':'pgl_basic'})[0]

#Just select the column we needed. 
point_column = stat_table['PTS']

print point_column

如果您还不熟悉熊猫，则可以从以下网站阅读更多信息： http : //pandas-docs.github.io/pandas-docs-travis/10min.html

例如，您可能要从表中删除标题行或将表拆分为多个表。

从带有Python的HTML表中仅提取一列数据？

问题描述

1 个解决方案

解决方案1
1 2015-04-07 13:10:47

从带有Python的HTML表中仅提取一列数据？

问题描述

1 个解决方案

解决方案1 1 2015-04-07 13:10:47

解决方案1
1 2015-04-07 13:10:47