[英]Merge two rows in the same Dataframe if their index is the same?
I have created a large Dataframe by pulling data from an Azure database.我通过从 Azure 数据库中提取数据创建了一个大型 Dataframe。 The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database. dataframe 的构造并不简单,因为我必须分部分进行,使用 concat function 将新列添加到从数据库中提取的数据集中。
This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index.这工作得很好,但是我按条目日期进行索引,并且在连接时我有时会得到两个具有相同索引的数据行。 Is it possible for me to merge lines with the same index?我可以合并具有相同索引的行吗? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.我在网上搜索了解决方案,但我总是遇到尝试合并两个单独的数据帧而不是合并同一 dataframe 中的行的示例。
Col1 Col2
2015-10-27 22:22:31 1400
2015-10-27 22:22:31 50.5
Col1 Col2
2015-10-27 22:22:31 1400 50.5
I have tried using the groupby function on index but that just messed up.我曾尝试在索引上使用 groupby function 但这只是搞砸了。 Most of the data columns disappeared and a few very large numbers were spat out.大部分数据列消失了,一些非常大的数字被吐出。
The data is in this sort of format, except with many more columns and is generally quite sparse!数据采用这种格式,除了更多的列并且通常非常稀疏!
Col1 Col2 ... Col_n-1 Col_n
2015-10-27 21:15:60+0 1220
2015-10-27 21:25:4+0 1420
2015-10-27 21:28:8+0 1410
2015-10-27 21:37:10+0 51.5
2015-10-27 21:37:11+0 1500
2015-10-27 21:46:14+0 51
2015-10-27 21:46:15+0 1390
2015-10-27 21:55:19+0 1370
2015-10-27 22:04:24+0 1450
2015-10-27 22:13:28+0 1350
2015-10-27 22:22:31+0 1400
2015-10-27 22:22:31+0 50.5
2015-10-27 22:25:33+0 1300
2015-10-27 22:29:42+0 ... 1900
2015-10-27 22:29:42+0 63
2015-10-27 22:34:36+0 1280
For anyone interested - I ended up writing my own function to: 对于任何感兴趣的人 - 我最终编写了自己的函数:
code: 码:
def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
'''
Filter Dataframe to merge lines which are within the same time block.
i.e. being part of the same x number of seconds, weeks, months...
data:
Dataframe to filter.
timeBlock_type:
Time period with which to group data rows. This can be data per:
SECONDS, DAYS, MILLISECONDS
timeBlock_factor:
Number of timeBlock types to group on.
'''
pd.options.mode.chained_assignment = None # default='warn'
tBt = timeBlock_type.upper()
tBf = timeBlock_factor
if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
roundType = 'SECONDS'
elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
roundType = 'MINUTES'
elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
roundType = 'MILLISECONDS'
elif tBt == 'WEEK' or tBt == 'WEEKS':
roundType = 'WEEKS'
else:
raise ValueError ('Invalid time block type entered')
numElements = len(data.columns)
anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
delIndex = []
mergeCount = 0
av_agg_arr = np.zeros([1,numElements], dtype=float)
#Cycling through dataframe to get averages and note which rows to delete
for i, row in data.iterrows(): #i is the index value, from 0
backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.
if (backDate > anchorValue): #if data should be grouped
delIndex.pop() #remove last index as this is the final row to use
delIndex.append(i) #add current row so that it isnt missed.
print('collate')
if mergeCount != 0:
av_agg_arr = av_agg_arr/mergeCount
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group
anchorValue = backDate
mergeCount = 0
# Re-initialising aggregates and passing in current row values.
av_agg_arr = av_agg_arr - av_agg_arr
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
else: #else if data is still part of same datetime group
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
mergeCount += 1
delIndex.append(i) #picking out index value of row
data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
data.reset_index()
pd.options.mode.chained_assignment = 'warn' # default='warn'
return data
Building up on @EdChum 's answer, it is also possible to use the min_count
parameter of groupBy.sum
to manage NaN values in different ways.基于@EdChum 的回答,还可以使用groupBy.sum
的min_count
参数以不同方式管理 NaN 值。 Let's say we have an additional row to the example:假设我们在示例中添加了一行:
Col1 Col2
2015-10-27 22:22:31 1400 NaN
2015-10-27 22:22:31 NaN 50.5
2022-08-02 16:00:00 1600 NaN
then,然后,
In [184]:
df.groupby('index').sum(min_count=1)
Out[184]:
Col1 Col2
index
2015-10-27 22:22:31 1400 50.5
2022-08-02 16:00:00 1600 NaN
Using min_count=0
will output 0 instead of NaN values.使用min_count=0
将 output 0 而不是 NaN 值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.