简体   繁体   English

Pandas:聚合列切片为 arrays

[英]Pandas: Aggregating column slices as arrays

I have a Pandas dataframe that looks as such:我有一个 Pandas dataframe 看起来像这样:

                      Scaled
               Date 
2020-07-01 02:40:00 0.604511
2020-07-01 02:45:00 0.640577
2020-07-01 02:50:00 0.587683
2020-07-01 02:55:00 0.491515
....

I am trying to add a new column called X which is supposed to look as such, where every two previous values become an array:我正在尝试添加一个名为X的新列,它应该看起来像这样,其中每两个先前的值变成一个数组:

                      Scaled   X
               Date 
2020-07-01 02:40:00 0.604511 nan
2020-07-01 02:45:00 0.640577 nan
2020-07-01 02:50:00 0.587683 [0.604511 0.640577]
2020-07-01 02:55:00 0.491515 [0.640577 0.587683]
...

I am trying a for -loop to do so, but I don't think this is the most elegant and efficient way, so any suggestion of how to do this in pandas?我正在尝试一个for -loop 这样做,但我认为这不是最优雅和最有效的方式,所以关于如何在 pandas 中执行此操作的任何建议? (but it is not going as intended) (但它没有按预期进行)

window_size = 2
for i in range(window_size, df.shape[0]):
    df['X'][i] = df['Scaled'][i - window_size:window_size] 

To use pandas, you may use list comprehension and concat and shift要使用 pandas,您可以使用列表理解和concatshift

window_size = 2
df['X'] = (pd.concat([df.Scaled.shift(-i) for i in range(window_size)], axis=1)
             .shift(window_size).values.tolist())

Out[213]:
     Scaled                               X
0  0.604511                      [nan, nan]
1  0.640577                      [nan, nan]
2  0.587683  [0.604511, 0.6405770000000001]
3  0.491515  [0.6405770000000001, 0.587683]

You had the right idea with using a for loop.你有使用 for 循环的正确想法。

First you have to initialize the new column you can use .apply() on your dataframe to do this.首先,您必须初始化新列,您可以在 dataframe 上使用.apply()来执行此操作。

Then you can use .iterrows() to iterate through the indices of the dataframe creating your desired array as you iterate through the rows.然后,您可以使用.iterrows()遍历 dataframe 的索引,在遍历行时创建所需的数组。

import pandas as pd

df = pd.DataFrame(data={'Date': ['2020-07-01 02:40:00', '2020-07-01 02:45:00', '2020-07-01 02:50:00', '2020-07-01 02:55:00'], 'Scaled': [0.604511, 0.640577, 0.587683, 0.491515]})

df['New_col'] = df['Scaled'].apply(lambda x : float("NAN"))

for i, val in df.iterrows():
  if i == 0 or i == 1:
    scaled_a = None
    scaled_b = None
  else:
    scaled_a = df['Scaled'][i-2]
    scaled_b = df['Scaled'][i-1]
  df['New_col'][i] = [scaled_a, scaled_b] 

Simply assign the value of your new column to the value of the dataframe Scaled column at the previous two indices an save it in an array.只需将新列的值分配给前两个索引处的 dataframe Scaled 列的值,然后将其保存在数组中。 Hope this helped!!希望这有帮助!

    Date                Scaled      New_col
0   2020-07-01 02:40:00 0.604511    [None, None]
1   2020-07-01 02:45:00 0.640577    [None, None]
2   2020-07-01 02:50:00 0.587683    [0.604511, 0.640577]
3   2020-07-01 02:55:00 0.491515    [0.640577, 0.587683]

The result should look something like this.结果应该是这样的。 ^^ ^^

updated Identical output.更新了相同的 output。 This is a pandas implementation.这是一个 pandas 实现。 Uses numpy generate list that is pandas column for df that is super efficient.使用 numpy 生成列表,该列表是 pandas 列,用于 df,非常高效。

d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Date":d, 
      "Scaled":[round(Decimal(random.uniform(0, 1)),6) for x in d]})


# generate two new arrays that are shifted version of *scaled*
a1 = np.roll(df["Scaled"],1)
a1[0:2] = None
a2 = np.roll(df["Scaled"],2)
a2[0:2] = None
# combine them into a list and put back into df
df['X'] = np.vstack((a2, a1)).T.tolist()

print(df[:10].to_string(index=False))

output output

               Date    Scaled                     X
2020-07-01 00:00:00  0.396534          [None, None]
2020-07-01 00:15:00  0.890777          [None, None]
2020-07-01 00:30:00  0.241534  [0.396534, 0.890777]
2020-07-01 00:45:00  0.800615  [0.890777, 0.241534]
2020-07-01 01:00:00  0.161382  [0.241534, 0.800615]
2020-07-01 01:15:00  0.727410  [0.800615, 0.161382]
2020-07-01 01:30:00  0.146833  [0.161382, 0.727410]
2020-07-01 01:45:00  0.925441  [0.727410, 0.146833]
2020-07-01 02:00:00  0.770211  [0.146833, 0.925441]
2020-07-01 02:15:00  0.310082  [0.925441, 0.770211]

Here's a version without the for loop.这是一个没有 for 循环的版本。 First, create the data frame:首先,创建数据框:

from io import StringIO

data = '''Date  Scaled 
2020-07-01 02:40:00  0.604511
2020-07-01 02:45:00  0.640577
2020-07-01 02:50:00  0.587683
2020-07-01 02:55:00  0.491515
'''
df = pd.read_csv(StringIO(data), sep='\s\s', engine='python')

Next, use shift() to get previous values, and the lambda function creates the 2-element list or yields a single NaN:接下来,使用shift()获取先前的值,然后 lambda function 创建 2 元素列表或产生单个 NaN:

f = lambda a, b: np.nan if np.isnan(a) or np.isnan(b) else [a, b]

window_size = 2

t = (pd.concat([df['Scaled'].shift(window_size).rename('a'), 
                df['Scaled'].shift(window_size - 1).rename('b')], axis=1
          )
       .apply(lambda x: f(x['a'].round(6), x['b'].round(6)), axis=1))

print(t)

0                     NaN
1                     NaN
2    [0.604511, 0.640577]
3    [0.640577, 0.587683]
dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM