[英]Obtain list of strings from column index of pandas dataframe
First off, here is what my .xlsx timeseries data looks like: 首先,这是我的.xlsx时间序列数据的样子:
and here is how I'm reading it: 这是我的阅读方式:
def loaddata(filepaths):
t1 = time.clock()
for i in range(len(filepaths)):
xl = pd.ExcelFile(filepaths[i])
df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
df = df.dropna(axis=1, how='all')
df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
df.index = pd.DatetimeIndex(((df.index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values
if i == 0:
dfs = df
else:
dfs = concat([dfs, df], axis=1)
t2 = time.clock()
print "Files loaded into dataframe in %s seconds" %(t2-t1)
return dfs
files = ["London Lysimeters corrected 5min.xlsx"]
data = loaddata(files)
What I need to be able to do is read the column labels AND units (row 2 and 3) as well as the values into a pandas dataframe, and be able to access the labels and units row as a list of strings. 我需要做的是将列标签AND单位(第2行和第3行)以及值读入pandas数据框中,并能够以字符串列表的形式访问标签和单位行。 I can't seem to figure out how to load up both row 2 and 3 and have the time read in correctly into pandas datetimeindex, but it works fine if I only upload the labels. 我似乎无法弄清楚如何加载第2行和第3行,以及如何正确地将时间读入pandas datetimeindex中,但是如果我仅上传标签,效果很好。 Also I've looked everywhere and can't figure out how to get the column headers as a list. 另外,我到处都看过了,无法弄清楚如何将列标题作为列表获取。
I would appreciate it if anyone could help with either of these issues. 如果有人可以帮助解决这些问题,我将不胜感激。
First of all, get rid of that for i in range(len(filepaths))
! 首先, for i in range(len(filepaths))
消除它! The pythonic way is for i, filepath in enumerate(filepaths)
. for i, filepath in enumerate(filepaths)
pythonic方式是for i, filepath in enumerate(filepaths)
。 enumerate
gives a tuple so you can say ExcelFile(filepath)
instead of ExcelFile(filepaths[i])
. enumerate
给出一个元组,因此您可以说ExcelFile(filepath)
而不是ExcelFile(filepaths[i])
。
I think your two problems are related. 我认为您的两个问题有关。 If I'm reading your code correctly, when you include row 2 and 3 the dates can't be parsed since the timestamp column isn't homogenous. 如果我正确地阅读了您的代码,则当您添加第2行和第3行时,由于timestamp列不相同,因此无法解析日期。 It's not all timestamps. 并不是所有的时间戳。
You could use a Hierarchical index to get the data in (column, label, unit)
format. 您可以使用层次索引来获取(column, label, unit)
格式的数据。 It's probably easiest to first read in just the header information. 首先只读标头信息可能是最简单的。 Then read the data separately and set the columns after the fact (I don't have excel handy right now, but I think all the read_csv
options I use are available to xlrd
also): 然后分别读取数据并在事实之后设置列(我现在还没有excel的功能,但是我认为我使用的所有read_csv
选项也可用于xlrd
):
In [7]: df_header = pd.read_csv('test.csv', nrows=2, index_col='three')
In [8]: df_header
Out[8]:
one two four
three
Timestamp Decimal Decimal record
ts ref ref rn
In [9]: df_data = pd.read_csv('test.csv', names=df_header.columns,
...: skiprows=4, parse_dates=True, index_col=2)
In [10]: df_data
Out[10]:
one two four
2012-08-29 07:10:00 32.1 32.0 232
2012-08-29 09:10:00 1.1 1.2 233
In [11]: cols = pd.MultiIndex.from_tuples([tuple([x] + df_header[x].tolist())
....: for x in df_header])
In [12]: cols
Out[12]:
MultiIndex
[one Decimal ref, two Decimal ref, four record rn ]
In [14]: df_data.columns = cols
In [15]: df_data
Out[15]:
one two four
Decimal Decimal record
ref ref rn
2012-08-29 07:10:00 32.1 32.0 232
2012-08-29 09:10:00 1.1 1.2 233
This should get you to the point in your code where you start dropping columns and start concatenating. 这应该使您到达代码中开始删除列并开始连接的地步。 Also take a look at the developers docs . 还可以看看开发人员文档 。 It looks like the syntax for reading excel files is getting cleaned up (much nicer!). 似乎正在清理用于读取excel文件的语法(好多了!)。 You might be able to use the parse_cols
argument with a list of ints to avoid dropping columns later. 您也许可以将parse_cols
参数与一个整数列表一起使用,以避免以后删除列。
Oh and you can get the list of strings with df_data.columns.tolist()
哦,您可以使用df_data.columns.tolist()
获得字符串列表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.