[英]How to rearrange a python pandas dataframe?
I have the following dataframe read in from a .csv file with the "Date" column being the index. 我从.csv文件读入以下数据帧,其中“Date”列是索引。 The days are in the rows and the columns show the values for the hours that day. 日期在行中,列显示当天的小时值。
> Date h1 h2 h3 h4 ... h24
> 14.03.2013 60 50 52 49 ... 73
I would like to arrange it like this, so that there is one index column with the date/time and one column with the values in a sequence 我想像这样安排它,这样就有一个索引列带有日期/时间,一列带有序列中的值
>Date/Time Value
>14.03.2013 00:00:00 60
>14.03.2013 01:00:00 50
>14.03.2013 02:00:00 52
>14.03.2013 03:00:00 49
>.
>.
>.
>14.03.2013 23:00:00 73
I was trying it by using two loops to go through the dataframe. 我通过使用两个循环来遍历数据帧来尝试它。 Is there an easier way to do this in pandas? 在熊猫中有更简单的方法吗?
I'm not the best at date manipulations, but maybe something like this: 我不是最好的约会操纵,但可能是这样的:
import pandas as pd
from datetime import timedelta
df = pd.read_csv("hourmelt.csv", sep=r"\s+")
df = pd.melt(df, id_vars=["Date"])
df = df.rename(columns={'variable': 'hour'})
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
combined = df.apply(lambda x:
pd.to_datetime(x['Date'], dayfirst=True) +
timedelta(hours=int(x['hour'])), axis=1)
df['Date'] = combined
del df['hour']
df = df.sort("Date")
Some explanation follows. 一些解释如下。
Starting from 从...开始
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>>
>>> df = pd.read_csv("hourmelt.csv", sep=r"\s+")
>>> df
Date h1 h2 h3 h4 h24
0 14.03.2013 60 50 52 49 73
1 14.04.2013 5 6 7 8 9
We can use pd.melt
to make the hour columns into one column with that value: 我们可以使用pd.melt
将小时列放到一个具有该值的列中:
>>> df = pd.melt(df, id_vars=["Date"])
>>> df = df.rename(columns={'variable': 'hour'})
>>> df
Date hour value
0 14.03.2013 h1 60
1 14.04.2013 h1 5
2 14.03.2013 h2 50
3 14.04.2013 h2 6
4 14.03.2013 h3 52
5 14.04.2013 h3 7
6 14.03.2013 h4 49
7 14.04.2013 h4 8
8 14.03.2013 h24 73
9 14.04.2013 h24 9
Get rid of those h
s: 摆脱那些h
S:
>>> df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>>> df
Date hour value
0 14.03.2013 0 60
1 14.04.2013 0 5
2 14.03.2013 1 50
3 14.04.2013 1 6
4 14.03.2013 2 52
5 14.04.2013 2 7
6 14.03.2013 3 49
7 14.04.2013 3 8
8 14.03.2013 23 73
9 14.04.2013 23 9
Combine the two columns as a date: 将这两列合并为一个日期:
>>> combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + timedelta(hours=int(x['hour'])), axis=1)
>>> combined
0 2013-03-14 00:00:00
1 2013-04-14 00:00:00
2 2013-03-14 01:00:00
3 2013-04-14 01:00:00
4 2013-03-14 02:00:00
5 2013-04-14 02:00:00
6 2013-03-14 03:00:00
7 2013-04-14 03:00:00
8 2013-03-14 23:00:00
9 2013-04-14 23:00:00
Reassemble and clean up: 重新组装和清理:
>>> df['Date'] = combined
>>> del df['hour']
>>> df = df.sort("Date")
>>> df
Date value
0 2013-03-14 00:00:00 60
2 2013-03-14 01:00:00 50
4 2013-03-14 02:00:00 52
6 2013-03-14 03:00:00 49
8 2013-03-14 23:00:00 73
1 2013-04-14 00:00:00 5
3 2013-04-14 01:00:00 6
5 2013-04-14 02:00:00 7
7 2013-04-14 03:00:00 8
9 2013-04-14 23:00:00 9
You could always grab the hourly data_array and flatten it. 你总是可以抓住每小时的data_array并将其展平。 You would generate a new DatetimeIndex with hourly freq. 您将生成一个带有每小时频率的新DatetimeIndex。
df = df.asfreq('D')
hourly_data = df.values[:, :]
new_ind = pd.date_range(start=df.index[0], freq="H", periods=len(df) * 24)
# create Series.
s = pd.Series(hourly_data.flatten(), index=new_ind)
I'm assuming that read_csv is parsing the 'Date' column and making it the index. 我假设read_csv正在解析'Date'列并使其成为索引。 We change to frequency of 'D' so that the new_ind
lines up correctly if you have missing days. 我们更改为“D”的频率,以便在您缺少天数时正确new_ind
。 The missing days will be filled with np.nan
which you can drop with s.dropna()
. 缺少的日子将用np.nan
填充,您可以使用s.dropna()
删除它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.