从pandas数据框的列索引获取字符串列表

Question

First off, here is what my .xlsx timeseries data looks like: 首先，这是我的.xlsx时间序列数据的样子：

数据在Excel中是什么样的

and here is how I'm reading it: 这是我的阅读方式：

def loaddata(filepaths):
    t1 = time.clock()
    for i in range(len(filepaths)):
        xl = pd.ExcelFile(filepaths[i])
        df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
        df = df.dropna(axis=1, how='all') 
        df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
        df.index = pd.DatetimeIndex(((df.index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values

        if i == 0:
            dfs = df
        else:
            dfs = concat([dfs, df], axis=1)

    t2 = time.clock()
    print "Files loaded into dataframe in %s seconds" %(t2-t1)

    return dfs

files = ["London Lysimeters corrected 5min.xlsx"]
data = loaddata(files)

What I need to be able to do is read the column labels AND units (row 2 and 3) as well as the values into a pandas dataframe, and be able to access the labels and units row as a list of strings. 我需要做的是将列标签AND单位（第2行和第3行）以及值读入pandas数据框中，并能够以字符串列表的形式访问标签和单位行。 I can't seem to figure out how to load up both row 2 and 3 and have the time read in correctly into pandas datetimeindex, but it works fine if I only upload the labels. 我似乎无法弄清楚如何加载第2行和第3行，以及如何正确地将时间读入pandas datetimeindex中，但是如果我仅上传标签，效果很好。 Also I've looked everywhere and can't figure out how to get the column headers as a list. 另外，我到处都看过了，无法弄清楚如何将列标题作为列表获取。

I would appreciate it if anyone could help with either of these issues. 如果有人可以帮助解决这些问题，我将不胜感激。

Answer 1

First of all, get rid of that for i in range(len(filepaths)) ! 首先， for i in range(len(filepaths))消除它！ The pythonic way is for i, filepath in enumerate(filepaths) . for i, filepath in enumerate(filepaths) pythonic方式是for i, filepath in enumerate(filepaths) 。 enumerate gives a tuple so you can say ExcelFile(filepath) instead of ExcelFile(filepaths[i]) . enumerate给出一个元组，因此您可以说ExcelFile(filepath)而不是ExcelFile(filepaths[i]) 。

I think your two problems are related. 我认为您的两个问题有关。 If I'm reading your code correctly, when you include row 2 and 3 the dates can't be parsed since the timestamp column isn't homogenous. 如果我正确地阅读了您的代码，则当您添加第2行和第3行时，由于timestamp列不相同，因此无法解析日期。 It's not all timestamps. 并不是所有的时间戳。

You could use a Hierarchical index to get the data in (column, label, unit) format. 您可以使用层次索引来获取(column, label, unit)格式的数据。 It's probably easiest to first read in just the header information. 首先只读标头信息可能是最简单的。 Then read the data separately and set the columns after the fact (I don't have excel handy right now, but I think all the read_csv options I use are available to xlrd also): 然后分别读取数据并在事实之后设置列（我现在还没有excel的功能，但是我认为我使用的所有read_csv选项也可用于xlrd ）：

In [7]: df_header = pd.read_csv('test.csv', nrows=2, index_col='three')

In [8]: df_header
Out[8]: 
               one      two    four
three                              
Timestamp  Decimal  Decimal  record
ts             ref      ref      rn

In [9]: df_data = pd.read_csv('test.csv', names=df_header.columns,
   ...:                       skiprows=4, parse_dates=True, index_col=2)

In [10]: df_data
Out[10]: 
                      one   two  four
2012-08-29 07:10:00  32.1  32.0   232
2012-08-29 09:10:00   1.1   1.2   233

In [11]: cols = pd.MultiIndex.from_tuples([tuple([x] + df_header[x].tolist())
   ....:                                   for x in df_header])

In [12]: cols
Out[12]: 
MultiIndex
[one   Decimal  ref, two   Decimal  ref, four  record   rn ]

In [14]: df_data.columns = cols

In [15]: df_data
Out[15]: 
                         one      two    four
                     Decimal  Decimal  record
                         ref      ref      rn
2012-08-29 07:10:00     32.1     32.0     232
2012-08-29 09:10:00      1.1      1.2     233

This should get you to the point in your code where you start dropping columns and start concatenating. 这应该使您到达代码中开始删除列并开始连接的地步。 Also take a look at the developers docs . 还可以看看开发人员文档。 It looks like the syntax for reading excel files is getting cleaned up (much nicer!). 似乎正在清理用于读取excel文件的语法（好多了！）。 You might be able to use the parse_cols argument with a list of ints to avoid dropping columns later. 您也许可以将parse_cols参数与一个整数列表一起使用，以避免以后删除列。

Oh and you can get the list of strings with df_data.columns.tolist() 哦，您可以使用df_data.columns.tolist()获得字符串列表。

从pandas数据框的列索引获取字符串列表

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-07-23 12:30:16

从pandas数据框的列索引获取字符串列表

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-07-23 12:30:16

解决方案1
1 已采纳 2013-07-23 12:30:16