简体   繁体   English

从pandas数据框的列索引获取字符串列表

[英]Obtain list of strings from column index of pandas dataframe

First off, here is what my .xlsx timeseries data looks like: 首先,这是我的.xlsx时间序列数据的样子:

数据在Excel中是什么样的

and here is how I'm reading it: 这是我的阅读方式:

def loaddata(filepaths):
    t1 = time.clock()
    for i in range(len(filepaths)):
        xl = pd.ExcelFile(filepaths[i])
        df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
        df = df.dropna(axis=1, how='all') 
        df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
        df.index = pd.DatetimeIndex(((df.index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values

        if i == 0:
            dfs = df
        else:
            dfs = concat([dfs, df], axis=1)

    t2 = time.clock()
    print "Files loaded into dataframe in %s seconds" %(t2-t1)

    return dfs

files = ["London Lysimeters corrected 5min.xlsx"]
data = loaddata(files)

What I need to be able to do is read the column labels AND units (row 2 and 3) as well as the values into a pandas dataframe, and be able to access the labels and units row as a list of strings. 我需要做的是将列标签AND单位(第2行和第3行)以及值读入pandas数据框中,并能够以字符串列表的形式访问标签和单位行。 I can't seem to figure out how to load up both row 2 and 3 and have the time read in correctly into pandas datetimeindex, but it works fine if I only upload the labels. 我似乎无法弄清楚如何加载第2行和第3行,以及如何正确地将时间读入pandas datetimeindex中,但是如果我仅上传标签,效果很好。 Also I've looked everywhere and can't figure out how to get the column headers as a list. 另外,我到处都看过了,无法弄清楚如何将列标题作为列表获取。

I would appreciate it if anyone could help with either of these issues. 如果有人可以帮助解决这些问题,我将不胜感激。

First of all, get rid of that for i in range(len(filepaths)) ! 首先, for i in range(len(filepaths))消除它! The pythonic way is for i, filepath in enumerate(filepaths) . for i, filepath in enumerate(filepaths) pythonic方式是for i, filepath in enumerate(filepaths) enumerate gives a tuple so you can say ExcelFile(filepath) instead of ExcelFile(filepaths[i]) . enumerate给出一个元组,因此您可以说ExcelFile(filepath)而不是ExcelFile(filepaths[i])

I think your two problems are related. 我认为您的两个问题有关。 If I'm reading your code correctly, when you include row 2 and 3 the dates can't be parsed since the timestamp column isn't homogenous. 如果我正确地阅读了您的代码,则当您添加第2行和第3行时,由于timestamp列不相同,因此无法解析日期。 It's not all timestamps. 并不是所有的时间戳。

You could use a Hierarchical index to get the data in (column, label, unit) format. 您可以使用层次索引来获取(column, label, unit)格式的数据。 It's probably easiest to first read in just the header information. 首先只读标头信息可能是最简单的。 Then read the data separately and set the columns after the fact (I don't have excel handy right now, but I think all the read_csv options I use are available to xlrd also): 然后分别读取数据并在事实之后设置列(我现在还没有excel的功能,但是我认为我使用的所有read_csv选项也可用于xlrd ):

In [7]: df_header = pd.read_csv('test.csv', nrows=2, index_col='three')

In [8]: df_header
Out[8]: 
               one      two    four
three                              
Timestamp  Decimal  Decimal  record
ts             ref      ref      rn

In [9]: df_data = pd.read_csv('test.csv', names=df_header.columns,
   ...:                       skiprows=4, parse_dates=True, index_col=2)

In [10]: df_data
Out[10]: 
                      one   two  four
2012-08-29 07:10:00  32.1  32.0   232
2012-08-29 09:10:00   1.1   1.2   233

In [11]: cols = pd.MultiIndex.from_tuples([tuple([x] + df_header[x].tolist())
   ....:                                   for x in df_header])

In [12]: cols
Out[12]: 
MultiIndex
[one   Decimal  ref, two   Decimal  ref, four  record   rn ]

In [14]: df_data.columns = cols

In [15]: df_data
Out[15]: 
                         one      two    four
                     Decimal  Decimal  record
                         ref      ref      rn
2012-08-29 07:10:00     32.1     32.0     232
2012-08-29 09:10:00      1.1      1.2     233

This should get you to the point in your code where you start dropping columns and start concatenating. 这应该使您到达代码中开始删除列并开始连接的地步。 Also take a look at the developers docs . 还可以看看开发人员文档 It looks like the syntax for reading excel files is getting cleaned up (much nicer!). 似乎正在清理用于读取excel文件的语法(好多了!)。 You might be able to use the parse_cols argument with a list of ints to avoid dropping columns later. 您也许可以将parse_cols参数与一个整数列表一起使用,以避免以后删除列。

Oh and you can get the list of strings with df_data.columns.tolist() 哦,您可以使用df_data.columns.tolist()获得字符串列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从字符串列表中删除某些字符串作为pandas.DataFrame中的列 - Remove certain strings from list of strings as column in pandas.DataFrame 如何从 python pandas dataframe 的列中的列表中提取字符串? - How to extract strings from a list in a column in a python pandas dataframe? 如何从 python pandas 数据框中的列中的列表中提取关键字(字符串)? - How to extract keywords (strings) from a list in a column in a python pandas dataframe? Python Pandas:有没有办法根据列表中的字符串获取子集dataframe - Python Pandas: Is there a way to obtain a subset dataframe based on strings in a list 熊猫从带有索引列的词典列表中创建一个DataFrame - Pandas create a DataFrame from a list of dictionaries with an index column 检查 Pandas DataFrame 列中的字符串是否在字符串列表中 - Check if a string in a Pandas DataFrame column is in a list of strings 如何将字符串从pandas DataFrame的一列插入特定索引的另一列? - How to insert strings from one column of pandas DataFrame to another column at specific index? 从Pandas DataFrame列中删除字符串 - Removing Strings from a Pandas DataFrame Column 在 Pandas DataFrame 中用 NaN 替换字符串(来自列表) - Replacing strings (from a list) with NaN in pandas DataFrame 从字符串列表创建pandas数据帧 - Creating pandas dataframe from a list of strings
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM