简体   繁体   English

合并行熊猫数据框

[英]Merging rows pandas dataframe

I have a pandas dataframe that looks like this: 我有一个看起来像这样的熊猫数据框:

df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])

df
Out[18]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          21      1          A        A           8         9
2             0          13      1          B        B          11        13
3             0          12      1          C        C          14        15
4            13          14      0          C        C          16        18

I would like to group consecutive rows having the same Value if the difference between Start Time of row i+1 and End Time of row i is < 3 我想组连续的行具有相同Value ,如果行的开始时间之间的差i+1和行的结束时间i< 3

For example rows 1,2,3 are consecutive rows having the same value. 例如,行1,2,3是具有相同值的连续行。

df['Start Time'].iloc[2] - df['End Time'].iloc[1] is = 2
df['Start Time'].iloc[3] - df['End Time'].iloc[2] is = 1

So they all should be merged. 因此,它们都应该合并。 I would like that these rows become: 我希望这些行变为:

df2
Out[25]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          12      1          A        C           8        15
2            13          14      0          C        C          16        18

Please note that the new merged row should have: 请注意,新合并的行应具有:

1) Start Sample = to the Start Sample of the first row merged
2) End Sample = to the End Sample of the last row merged
3) Value = to the common value
4) Start Name = to the Start Name of the first row merged
5) End Name = to the End Name of the last row merged
6) Start Time = to the Start Name of the first row merged
7) End Name = to the End Name of the last row merged

First some code for you to consider then some explanation. 首先提供一些代码供您考虑,然后再进行一些解释。 The approach here is to break into subsets based on your "Value" and work on those sub-dataframes. 这里的方法是根据您的“值”分成子集,并处理这些子数据帧。

def agg(series):
    if series.name.startswith('Start'):
        return series.iloc[0]
    return series.iloc[-1]

subsets = [subset.apply(agg) for _, subset in 
             df.groupby((df['Value']!=df['Value'].shift(1)).cumsum())]

pd.concat(subsets, axis=1).T

The "tricky" part is df['Value']!=df['Value'].shift(1)).cumsum() . “棘手”部分是df['Value']!=df['Value'].shift(1)).cumsum() This finds when the "Value" changes. 查找“值”何时更改。 We will groupby that but first the cumsum() gives the unique values. 我们将进行cumsum() ,但是首先cumsum()给出唯一值。

After the groupby , you are iterating through the subsets of dataframes you are interested in. From here you can do a great many things which is why this is flexible. groupby ,您将遍历您感兴趣的数据帧的子集。在这里,您可以做很多事情,这就是为什么它很灵活的原因。

For each subset, the apply function will apply to each series (column). 对于每个子集, apply函数将应用于每个系列(列)。 In your case, you are looking for one of two values based on the column name so one function ( agg here) can be applied to each series. 在您的情况下,您正在根据列名查找两个值之一,因此一个函数(此处为agg )可以应用于每个序列。

Edit: The above test for change only included one of the two criteria OP specified. 编辑:上面的更改测试仅包括指定的两个条件OP之一。 Including both is easy enough but extends the logic so it should be broken out a little. 包括两者都很容易,但是扩展了逻辑,因此应该稍微加以突破。 I was already pushing the bounds of an unreasonable oneliner for that logic. 我已经在为这种逻辑推开一个不合理的oneliner的界限。 so the groupby condition should be: 因此groupby条件应为:

val_chg = df['Value'] != df['Value'].shift(1)
time_chg = df['Start Time']-df['End Time'].shift(1) >=3

df.groupby((val_chg | time_chg).cumsum())

There are probably better ways to do it but here is iterrows() approach: 可能有更好的方法,但这是iterrows()方法:

df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df['keep'] = ''

active_row = None

for i, row in df.iterrows():
    if active_row is None:
        active_row = i
        df.loc[i,'keep'] = 1
        continue

    if row['Value'] != df.loc[active_row,'Value']:
        active_row = i
        df.loc[i,'keep'] = 1
        continue
    elif row['Start Time'] - df.loc[active_row,'End Time'] >= 3:
        active_row = i
        df.loc[i,'keep'] = 1
        continue

    df.loc[active_row,'End Time'] = row['End Time']
    df.loc[active_row,'End Sample'] = row['End Sample']
    df.loc[active_row,'End Name'] = row['End Name']
    df.loc[i,'keep'] = 0

final_df=df[df.keep == 1].drop('keep',axis=1)

It's iterating through rows, remebering the last meaningfull row and updating it during the loop. 它遍历行,重新记录最后一个有意义的行并在循环期间进行更新。 Each loop classifies a row as keep (1) or not to keep(0), and we use it to manually filter them out by the end. 每个循环将一行分类为keep(1)或不分类为keep(0),我们使用它在最后手动过滤掉它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM