简体   繁体   English

从带有Python的HTML表中仅提取一列数据?

[英]Extract just one column of data from HTML table w/ Python?

I'm attempting to extract some NBA stats for a little project I am doing, and I need to extract just a couple of the columns' (going up and down, vertically) data from an HTML table, like this one here . 我正在尝试为我正在做的一个小项目提取一些NBA统计信息,并且我只需要从HTML表格中提取几列(垂直向上和向下)数据,例如此处的数据 I am only trying to get PTS for now, so how should I go about only pulling out that one column of data? 我现在只想获取PTS,那么我应该如何只提取那一列数据呢? I've figured out that it is the third to last element of each data-row, but I am not sure how I should go about parsing the data. 我已经知道它是每个数据行的倒数第三个元素,但是我不确定应该如何解析数据。

I would suggest that you read the whole html table and then just select the column you need. 我建议您阅读整个html表,然后选择所需的列。 Maybe you will lose something in speed but you will gain more in simplicity. 也许您会失去一些速度,但会获得更多的简单性。

That is easy to do with pandas' read_html function: 使用pandas的read_html函数很容易做到:

import urllib2
import pandas as pd

page1 = urllib2.urlopen(
    'http://www.basketball-reference.com/players/h/hardeja01/gamelog/2015/').read()

#Select the correct table by some attributes, in this case id=pgl_basic.
#The read_html function returns a list of tables.
#In this case we select the first (and only) table with this id
stat_table = pd.io.html.read_html(page1,attrs={'id':'pgl_basic'})[0]

#Just select the column we needed. 
point_column = stat_table['PTS']

print point_column

If you are not familiar with pandas yet you can read more from: http://pandas-docs.github.io/pandas-docs-travis/10min.html 如果您还不熟悉熊猫,则可以从以下网站阅读更多信息: http : //pandas-docs.github.io/pandas-docs-travis/10min.html

For example you might want to remove the header rows from the table or split the table to multiple tables. 例如,您可能要从表中删除标题行或将表拆分为多个表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM