For this question, let's take the following example. I have a dataframe which looks as follows ( df.head()
):
Unnamed: 0 PacketTime FrameLen FrameCapLen ... Speed Delay Loss Interval
0 1 0.056078 116 116 ... 25 0 0 0
1 2 0.056106 66 66 ... 25 0 0 0
2 3 2.058089 116 116 ... 25 0 0 2
3 4 2.058115 66 66 ... 25 0 0 2
4 5 4.060316 116 116 ... 25 0 0 4
[5 rows x 23 columns]
As you can see the groups are by the Interval
column. I know that pandas has a df.groupby(colname)
, but what I wish to do is to group the interval rows, such that the column values are listed together. This would give an example output as follows:
Unnamed: 0 PacketTime FrameLen FrameCapLen ... Speed Delay Loss Interval
0 1 0.000028 116,66 116,66 ... 25,25 0,0 0,0 0
1 2 0.000026 116,66 116,66 ... 25,25 0,0 0,0 2
...
[5 rows x 23 columns]
As you can see the desired end result is to have the columns grouped into a list for the interval groups, and the packet time is combined such that the value is max(PacketTime)-min(PacketTime)
for each interval group.
These are two separate tasks. For both, let's use the fact that a groupby operation which does the following process :
Split a single data frame into multiple data frames based on a single column. Apply operation to each data frame. Stich the resulting data frames together.
First job:
Have a single line per interval for all columns other then PacketTime - where each value is a list of the two values.
We want to stitch stuff to a list. So let's use series.to_list()
for that. For a reason unknown to me, calling df.apply(lambda s: s.to_list() )
won't work. Pandas automatically convert the list back to normal columns - however calling this on rows return what we want: a series of lists. Thus we will convert columns to rows, apply to_list on rows (which are former columns).
Example
df.T.apply(lambda series: series.to_list(), axis='columns')
results in:
PacketTime [0.056078, 0.056106, 2.058089, 2.058115, 4.060...
FrameLen [116.0, 66.0, 116.0, 66.0, 116.0]
FrameCapLen [116.0, 66.0, 116.0, 66.0, 116.0]
Unnamed: 3 [nan, nan, nan, nan, nan]
Speed [25.0, 25.0, 25.0, 25.0, 25.0]
Delay [0.0, 0.0, 0.0, 0.0, 0.0]
Loss [0.0, 0.0, 0.0, 0.0, 0.0]
Interval [0.0, 0.0, 2.0, 2.0, 4.0]
This is exactly what we want for each Interval. So let's define it as a function and apply it to each interval then, right?!
import pandas as pd
df = pd.read_excel('example.xlsx')
def to_list(df):
return df.T.apply(lambda x: x.to_list(), axis='columns')
df_other = df.groupby('Interval')\
.apply(to_list)\
.drop(columns='PacketTime')
Second job:
For calculating the duration, all we need is a function that takes the minimum of the time and a maximum of the time and deduces them to have the time length:
def min_max(s):
return s.max()-s.min()
Now we just apply it and join the two dfs together:
s_Interval = df.groupby('Interval')['PacketTime']\
.apply(min_max)
final_df = pd.concat([df_other,s_Interval], axis= 'columns')
We end up with:
print(final_df.to_markdown())
| Interval | FrameLen | FrameCapLen | Unnamed: 3 | Speed | Delay | Loss | Interval | PacketTime |
|-----------:|:--------------|:--------------|:-------------|:-------------|:-----------|:-----------|:-----------|-------------:|
| 0 | [116.0, 66.0] | [116.0, 66.0] | [nan, nan] | [25.0, 25.0] | [0.0, 0.0] | [0.0, 0.0] | [0.0, 0.0] | 2.8e-05 |
| 2 | [116.0, 66.0] | [116.0, 66.0] | [nan, nan] | [25.0, 25.0] | [0.0, 0.0] | [0.0, 0.0] | [2.0, 2.0] | 2.6e-05 |
| 4 | [116.0] | [116.0] | [nan] | [25.0] | [0.0] | [0.0] | [4.0] | 0 |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.