简体   繁体   English

如果索引相同,则合并同一 Dataframe 中的两行?

[英]Merge two rows in the same Dataframe if their index is the same?

I have created a large Dataframe by pulling data from an Azure database.我通过从 Azure 数据库中提取数据创建了一个大型 Dataframe。 The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database. dataframe 的构造并不简单,因为我必须分部分进行,使用 concat function 将新列添加到从数据库中提取的数据集中。

This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index.这工作得很好,但是我按条目日期进行索引,并且在连接时我有时会得到两个具有相同索引的数据行。 Is it possible for me to merge lines with the same index?我可以合并具有相同索引的行吗? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.我在网上搜索了解决方案,但我总是遇到尝试合并两个单独的数据帧而不是合并同一 dataframe 中的行的示例。

In summary:总之:

This这个

                      Col1  Col2
2015-10-27 22:22:31   1400  
2015-10-27 22:22:31         50.5

To this对此

                      Col1  Col2
2015-10-27 22:22:31   1400  50.5

I have tried using the groupby function on index but that just messed up.我曾尝试在索引上使用 groupby function 但这只是搞砸了。 Most of the data columns disappeared and a few very large numbers were spat out.大部分数据列消失了,一些非常大的数字被吐出。

Note:笔记:

The data is in this sort of format, except with many more columns and is generally quite sparse!数据采用这种格式,除了更多的列并且通常非常稀疏!

                        Col1    Col2    ...    Col_n-1 Col_n    
2015-10-27 21:15:60+0   1220        
2015-10-27 21:25:4+0    1420        
2015-10-27 21:28:8+0    1410        
2015-10-27 21:37:10+0           51.5    
2015-10-27 21:37:11+0   1500        
2015-10-27 21:46:14+0           51  
2015-10-27 21:46:15+0   1390        
2015-10-27 21:55:19+0   1370        
2015-10-27 22:04:24+0   1450        
2015-10-27 22:13:28+0   1350        
2015-10-27 22:22:31+0   1400        
2015-10-27 22:22:31+0           50.5
2015-10-27 22:25:33+0   1300        
2015-10-27 22:29:42+0                   ...    1900 
2015-10-27 22:29:42+0                                  63       
2015-10-27 22:34:36+0   1280        

You can groupby on your index and call sum : 您可以对索引进行groupby并调用sum

In [184]:
df.groupby(level=0).sum()

Out[184]:
                     Col1  Col2
index                          
2015-10-27 22:22:31  1400  50.5

For anyone interested - I ended up writing my own function to: 对于任何感兴趣的人 - 我最终编写了自己的函数:

  1. go through dataframe 浏览数据框
  2. taking note of rows that need merging by taking note of the indexes 记下需要合并的行,记下索引
  3. aggregate or average the values across all rows 汇总或平均所有行的值
  4. delete all but one row of each set that needed merging replacing its values with the aggregations or averages (depending on what I needed) 删除每个集合中除了一行之外的所有需要​​合并,将其值替换为聚合或平均值(取决于我需要的)

code: 码:

def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
    '''
    Filter Dataframe to merge lines which are within the same time block.
    i.e. being part of the same x number of seconds, weeks, months... 

    data:
        Dataframe to filter.

    timeBlock_type:
        Time period with which to group data rows. This can be data per:
            SECONDS, DAYS, MILLISECONDS

    timeBlock_factor:
        Number of timeBlock types to group on.
    ''' 

    pd.options.mode.chained_assignment = None  # default='warn'

    tBt = timeBlock_type.upper()
    tBf = timeBlock_factor

    if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
        roundType = 'SECONDS'
    elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
        roundType = 'MINUTES'
    elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
        roundType = 'MILLISECONDS'
    elif tBt == 'WEEK' or tBt == 'WEEKS':
        roundType = 'WEEKS'
    else:
        raise ValueError ('Invalid time block type entered')

    numElements = len(data.columns)
    anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
    delIndex = []
    mergeCount = 0
    av_agg_arr = np.zeros([1,numElements], dtype=float)

    #Cycling through dataframe to get averages and note which rows to delete
    for i, row in data.iterrows(): #i is the index value, from 0
        backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
        data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.

        if (backDate > anchorValue): #if data should be grouped
            delIndex.pop() #remove last index as this is the final row to use
            delIndex.append(i) #add current row so that it isnt missed.
            print('collate')
            if mergeCount != 0:
                av_agg_arr = av_agg_arr/mergeCount
                for idx in range(1,numElements-1):
                    if isinstance(row.values[idx],float):
                        data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group

            anchorValue = backDate
            mergeCount = 0

            # Re-initialising aggregates and passing in current row values.
            av_agg_arr = av_agg_arr - av_agg_arr 
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
        else: #else if data is still part of same datetime group
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
            mergeCount += 1
            delIndex.append(i) #picking out index value of row

    data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
    data.reset_index()

    pd.options.mode.chained_assignment = 'warn'  # default='warn'
    return data

Building up on @EdChum 's answer, it is also possible to use the min_count parameter of groupBy.sum to manage NaN values in different ways.基于@EdChum 的回答,还可以使用groupBy.summin_count参数以不同方式管理 NaN 值。 Let's say we have an additional row to the example:假设我们在示例中添加了一行:

                      Col1  Col2
2015-10-27 22:22:31   1400   NaN
2015-10-27 22:22:31    NaN  50.5
2022-08-02 16:00:00   1600   NaN

then,然后,

In [184]:
df.groupby('index').sum(min_count=1)

Out[184]:
                     Col1  Col2
index                          
2015-10-27 22:22:31  1400  50.5
2022-08-02 16:00:00  1600   NaN

Using min_count=0 will output 0 instead of NaN values.使用min_count=0将 output 0 而不是 NaN 值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM