简体   繁体   中英

Merge two rows in the same Dataframe if their index is the same?

I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.

This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.

In summary:


                      Col1  Col2
2015-10-27 22:22:31   1400  
2015-10-27 22:22:31         50.5

To this

                      Col1  Col2
2015-10-27 22:22:31   1400  50.5

I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.


The data is in this sort of format, except with many more columns and is generally quite sparse!

                        Col1    Col2    ...    Col_n-1 Col_n    
2015-10-27 21:15:60+0   1220        
2015-10-27 21:25:4+0    1420        
2015-10-27 21:28:8+0    1410        
2015-10-27 21:37:10+0           51.5    
2015-10-27 21:37:11+0   1500        
2015-10-27 21:46:14+0           51  
2015-10-27 21:46:15+0   1390        
2015-10-27 21:55:19+0   1370        
2015-10-27 22:04:24+0   1450        
2015-10-27 22:13:28+0   1350        
2015-10-27 22:22:31+0   1400        
2015-10-27 22:22:31+0           50.5
2015-10-27 22:25:33+0   1300        
2015-10-27 22:29:42+0                   ...    1900 
2015-10-27 22:29:42+0                                  63       
2015-10-27 22:34:36+0   1280        

You can groupby on your index and call sum :

In [184]:

                     Col1  Col2
2015-10-27 22:22:31  1400  50.5

For anyone interested - I ended up writing my own function to:

  1. go through dataframe
  2. taking note of rows that need merging by taking note of the indexes
  3. aggregate or average the values across all rows
  4. delete all but one row of each set that needed merging replacing its values with the aggregations or averages (depending on what I needed)


def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
    Filter Dataframe to merge lines which are within the same time block.
    i.e. being part of the same x number of seconds, weeks, months... 

        Dataframe to filter.

        Time period with which to group data rows. This can be data per:

        Number of timeBlock types to group on.

    pd.options.mode.chained_assignment = None  # default='warn'

    tBt = timeBlock_type.upper()
    tBf = timeBlock_factor

    if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
        roundType = 'SECONDS'
    elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
        roundType = 'MINUTES'
    elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
        roundType = 'MILLISECONDS'
    elif tBt == 'WEEK' or tBt == 'WEEKS':
        roundType = 'WEEKS'
        raise ValueError ('Invalid time block type entered')

    numElements = len(data.columns)
    anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
    delIndex = []
    mergeCount = 0
    av_agg_arr = np.zeros([1,numElements], dtype=float)

    #Cycling through dataframe to get averages and note which rows to delete
    for i, row in data.iterrows(): #i is the index value, from 0
        backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
        data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.

        if (backDate > anchorValue): #if data should be grouped
            delIndex.pop() #remove last index as this is the final row to use
            delIndex.append(i) #add current row so that it isnt missed.
            if mergeCount != 0:
                av_agg_arr = av_agg_arr/mergeCount
                for idx in range(1,numElements-1):
                    if isinstance(row.values[idx],float):
                        data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group

            anchorValue = backDate
            mergeCount = 0

            # Re-initialising aggregates and passing in current row values.
            av_agg_arr = av_agg_arr - av_agg_arr 
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
        else: #else if data is still part of same datetime group
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
            mergeCount += 1
            delIndex.append(i) #picking out index value of row

    data.drop(data.index[delIndex], inplace=True) #delete all flagged rows

    pd.options.mode.chained_assignment = 'warn'  # default='warn'
    return data

Building up on @EdChum 's answer, it is also possible to use the min_count parameter of groupBy.sum to manage NaN values in different ways. Let's say we have an additional row to the example:

                      Col1  Col2
2015-10-27 22:22:31   1400   NaN
2015-10-27 22:22:31    NaN  50.5
2022-08-02 16:00:00   1600   NaN


In [184]:

                     Col1  Col2
2015-10-27 22:22:31  1400  50.5
2022-08-02 16:00:00  1600   NaN

Using min_count=0 will output 0 instead of NaN values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM