简体   繁体   English

将一组数据框行的列值转换为列中的列表

[英]Convert column values for a group of data frame rows into a list in the column

For this question, let's take the following example.对于这个问题,我们来看下面的例子。 I have a dataframe which looks as follows ( df.head() ):我有一个 dataframe 如下所示( df.head() ):

   Unnamed: 0  PacketTime  FrameLen  FrameCapLen  ...  Speed  Delay  Loss  Interval
0           1    0.056078       116          116  ...     25      0     0         0
1           2    0.056106        66           66  ...     25      0     0         0
2           3    2.058089       116          116  ...     25      0     0         2
3           4    2.058115        66           66  ...     25      0     0         2
4           5    4.060316       116          116  ...     25      0     0         4

[5 rows x 23 columns]

As you can see the groups are by the Interval column.如您所见,这些组位于“ Interval ”列。 I know that pandas has a df.groupby(colname) , but what I wish to do is to group the interval rows, such that the column values are listed together.我知道 pandas 有一个df.groupby(colname) ,但我想做的是对间隔行进行分组,以便列值一起列出。 This would give an example output as follows:这将给出一个示例 output 如下:

   Unnamed: 0  PacketTime  FrameLen  FrameCapLen  ...  Speed  Delay  Loss  Interval
0           1    0.000028       116,66          116,66  ...     25,25      0,0     0,0         0
1           2    0.000026        116,66           116,66  ...     25,25      0,0     0,0         2
...

[5 rows x 23 columns]

As you can see the desired end result is to have the columns grouped into a list for the interval groups, and the packet time is combined such that the value is max(PacketTime)-min(PacketTime) for each interval group.正如您所看到的,所需的最终结果是将列分组到间隔组的列表中,并且组合数据包时间,使得每个间隔组的值为max(PacketTime)-min(PacketTime)

These are two separate tasks.这是两个独立的任务。 For both, let's use the fact that a groupby operation which does the following process :对于两者,让我们使用执行以下过程的 groupby 操作这一事实:

Split a single data frame into multiple data frames based on a single column.基于单个列将单个数据框拆分为多个数据框。 Apply operation to each data frame.对每个数据框应用操作。 Stich the resulting data frames together.将生成的数据帧拼接在一起。

First job:第一份工作:

Have a single line per interval for all columns other then PacketTime - where each value is a list of the two values.除 PacketTime 之外的所有列的每个间隔都有一行 - 其中每个值都是两个值的列表。

We want to stitch stuff to a list.我们想把东西缝合到一个列表中。 So let's use series.to_list() for that.所以让我们使用series.to_list() For a reason unknown to me, calling df.apply(lambda s: s.to_list() ) won't work.由于我不知道的原因,调用df.apply(lambda s: s.to_list() )将不起作用。 Pandas automatically convert the list back to normal columns - however calling this on rows return what we want: a series of lists. Pandas 自动将列表转换回普通列 - 但是在行上调用它会返回我们想要的:一系列列表。 Thus we will convert columns to rows, apply to_list on rows (which are former columns).因此,我们将列转换为行,将 to_list 应用于行(以前的列)。

Example例子

df.T.apply(lambda series: series.to_list(), axis='columns')

results in:结果是:

PacketTime     [0.056078, 0.056106, 2.058089, 2.058115, 4.060...
FrameLen                       [116.0, 66.0, 116.0, 66.0, 116.0]
FrameCapLen                    [116.0, 66.0, 116.0, 66.0, 116.0]
Unnamed: 3                             [nan, nan, nan, nan, nan]
Speed                             [25.0, 25.0, 25.0, 25.0, 25.0]
Delay                                  [0.0, 0.0, 0.0, 0.0, 0.0]
Loss                                   [0.0, 0.0, 0.0, 0.0, 0.0]
Interval                               [0.0, 0.0, 2.0, 2.0, 4.0]

This is exactly what we want for each Interval.这正是我们想要的每个区间。 So let's define it as a function and apply it to each interval then, right?!因此,让我们将其定义为 function 并将其应用于每个间隔,对吧?!


import pandas as pd

df = pd.read_excel('example.xlsx')


def to_list(df):
    return df.T.apply(lambda x: x.to_list(), axis='columns')


df_other = df.groupby('Interval')\
            .apply(to_list)\
            .drop(columns='PacketTime')

Second job:第二份工作:

For calculating the duration, all we need is a function that takes the minimum of the time and a maximum of the time and deduces them to have the time length:为了计算持续时间,我们只需要一个 function,它取最短时间和最长时间并推导出它们的时间长度:

     
def min_max(s):
    return s.max()-s.min()

Now we just apply it and join the two dfs together:现在我们只需应用它并将两个 dfs 连接在一起:

s_Interval = df.groupby('Interval')['PacketTime']\
            .apply(min_max)
final_df = pd.concat([df_other,s_Interval], axis= 'columns')

We end up with:我们最终得到:


print(final_df.to_markdown())
|   Interval | FrameLen      | FrameCapLen   | Unnamed: 3   | Speed        | Delay      | Loss       | Interval   |   PacketTime |
|-----------:|:--------------|:--------------|:-------------|:-------------|:-----------|:-----------|:-----------|-------------:|
|          0 | [116.0, 66.0] | [116.0, 66.0] | [nan, nan]   | [25.0, 25.0] | [0.0, 0.0] | [0.0, 0.0] | [0.0, 0.0] |      2.8e-05 |
|          2 | [116.0, 66.0] | [116.0, 66.0] | [nan, nan]   | [25.0, 25.0] | [0.0, 0.0] | [0.0, 0.0] | [2.0, 2.0] |      2.6e-05 |
|          4 | [116.0]       | [116.0]       | [nan]        | [25.0]       | [0.0]      | [0.0]      | [4.0]      |      0       |




声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM