简体   繁体   English

计算表中每 x 行的平均值并创建新表

[英]Calculate average of every x rows in a table and create new table

I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values.我有一个很长的数据表(约 200 行 x 50 列),我需要创建一个代码来计算表中每两行和每一列的平均值,最终输出是一个新的平均值表值。 This is obviously crazy to do in Excel!这在 Excel 中显然很疯狂! I use python3 and I am aware of some similar questions: here , here and here .我使用 python3,我知道一些类似的问题: hereherehere But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table.但是这些都没有帮助,因为我需要一些优雅的代码来处理多列并生成一个有组织的数据表。 By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas.顺便说一下,我的原始数据表是使用 Pandas 导入的,并被定义为一个数据框,但在 Pandas 中找不到一种简单的方法来做到这一点。 Help is much appreciated.非常感谢帮助。

An example of the table (short version) is:该表的一个示例(简短版本)是:

a   b   c   d
2   50  25  26
4   11  38  44
6   33  16  25
8   37  27  25
10  28  48  32
12  47  35  45
14  8   16  7
16  12  16  30
18  22  39  29
20  9   15  47

Expected mean table:预期均值表:

a    b     c     d
3   30.5  31.5  35
7   35    21.5  25
11  37.5  41.5  38.5
15  10    16    18.5
19  15.5  27    38

You can create an artificial group using df.index//2 (or as @DSM pointed out, using np.arange(len(df))//2 - so that it works for all indices) and then use groupby:您可以使用df.index//2创建一个人工组(或如@DSM 指出的那样,使用np.arange(len(df))//2 - 以便它适用于所有索引),然后使用 groupby:

df.groupby(np.arange(len(df))//2).mean()
Out[13]: 
      a     b     c     d
0   3.0  30.5  31.5  35.0
1   7.0  35.0  21.5  25.0
2  11.0  37.5  41.5  38.5
3  15.0  10.0  16.0  18.5
4  19.0  15.5  27.0  38.0

You can approach this problem using pd.rolling() to create a rolling average and then just grab every second element using iloc您可以使用pd.rolling()创建滚动平均值来解决此问题,然后使用iloc抓取每个第二个元素

df = df.rolling(2).mean() 
df = df.iloc[::2, :]

Note that the first observation will be missing (ie the rolling starts at the top) so make sure to check that your data is sorted how you need it.请注意,第一个观察将丢失(即滚动从顶部开始),因此请确保检查您的数据是否按您需要的方式排序。

NumPythonic way would be to extract the elements as a NumPy array with df.values , then reshape to a 3D array with 2 elements along axis=1 and 4 along axis=2 and perform the average reduction along axis=1 and finally convert back to a dataframe, like so - NumPythonic方法是将提取的元素作为NumPy的阵列df.values ,然后重塑到3D阵列2沿元件axis=14沿着axis=2 ,并执行沿平均减少axis=1 ,最后转换回一个数据框,就像这样 -

pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))

As it turns out, you can introduce NumPy's very efficient tool : np.einsum to do this average-reduction as a combination of sum-reduction and scaling-down , like so -事实证明,您可以引入 NumPy 的非常有效的工具: np.einsum将这种average-reduction作为sum-reductionscaling-down ,就像这样 -

pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)

Please note that the proposed approaches assume that the number of rows is divisible by 2 .请注意,建议的方法假设行数可以被2整除。

Also as noted by @DSM , to preserve the column names, you need to add columns=df.columns when converting back to Dataframe, ie -同样正如noted by @DSMnoted by @DSM ,为了保留列名,您需要在转换回columns=df.columns时添加columns=df.columns df.columns,即 -

pd.DataFrame(...,columns=df.columns)

Sample run -样品运行 -

>>> df
    0   1   2   3
0   2  50  25  26
1   4  11  38  44
2   6  33  16  25
3   8  37  27  25
4  10  28  48  32
5  12  47  35  45
6  14   8  16   7
7  16  12  16  30
8  18  22  39  29
9  20   9  15  47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0

Runtime tests -运行时测试 -

In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby .在本节中,让我们测试迄今为止列出的所有三种方法来解决性能问题,包括@ayhan's solution with groupby

In [24]: A = np.random.randint(0,9,(200,50))

In [25]: df = pd.DataFrame(A)

In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop

In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 µs per loop

In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 µs per loop
df.set_index(np.arange(len(df)) // 2).mean(level=0)

In your case, as you want to average the rows, assuming your dataframe name is new在您的情况下,由于您想平均行,假设您的数据框名称是new

new = new.groupby(np.arange(len(new)) // 2).mean() 

If one wants to do the average for the columns如果想对列进行平均

new = new.groupby(np.arrange(len(new.columns)) // 2, axis=1).mean()

I got ValueError: Grouper and axis must be same length when I tried using numpy to create the artificial group.当我尝试使用numpy创建人工组时,出现ValueError: Grouper and axis must be same length As an alternative, you can use itertools which will generate an iterator of equal length to your Dataframe:作为替代方案,您可以使用itertools ,它会生成一个与您的 Dataframe 长度相等的迭代器:

SAMPLE_SIZE = 2
label_series = pd.Series(itertools.chain.from_iterable(itertools.repeat(x, SAMPLE_SIZE) for x in df.index))
sampled_df = df.groupby(label_series).mean()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM