简体   繁体   English

合并熊猫数据框中的某些行

[英]Merging certain rows in pandas dataframe

I have this dataframe, consisting in 73 rows: 我有此数据框,包括73行:

Date    Col1    Col2   Col3
1975   float   float  float
1976   float   float  float
1976   float   float  float
1977   float   float  float
1978   float   float  float
....
....

There are certain years appearing twice because the values were taken twice that year. 某些年份出现两次,因为该值是该年两次。 What I want to do is to merge those rows where the year is the same, taking the mean value of each column for those specific two rows. 我想做的是合并年份相同的那些行,取那两个特定行的每一列的平均值。 The fact is that I am still familiarizing with pandas and I don't really understand the usage of the loc and iloc selectors. 事实是,我仍然对熊猫很熟悉,但我并不真正了解loc和iloc选择器的用法。 This is what I have tried, but I am sure this is completely wrong and non-pythonic: 这是我尝试过的方法,但是我确定这是完全错误且非Python的:

for i in range(72):
    if df.Date[i]==df.Date[i+1]:
        df.Very_satisfied[i]= (df.Very_satisfied[i]+df.Very_satisfied[i+1])/2
        df.Fairly_satisfied[i]= (df.Fairly_satisfied[i]+df.Fairly_satisfied[i+1])/2
        df.NV_satisfied[i]= (df.NV_satisfied[i]+ df.NV_satisfied[i+1])/2
        df.Not_satisfied[i]= (df.Not_satisfied[i]+ df.Not_satisfied[i+1])/2
        df.DK[i]= (df.DK[i]+ df.DK[i+1])/2
        a=i+1
        str(a)
        df.drop(a)

where "very satisfied", "fairly satisfied" ecc. “非常满意”,“非常满意”的地方 are the columns. 是列。 The point in my code is: if two years are the same calculate the mean of each value, substitute it in the first row and delete the second row. 我的代码中的要点是:如果两年相同,则计算每个值的平均值,将其替换为第一行,然后删除第二行。 I really need something smarter and more elegant. 我真的需要更聪明,更优雅的东西。

You can use groupby() and then mean() for this. 您可以为此使用groupby() ,然后使用mean() Here is an example : 这是一个例子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': list(range(25)) * 2, 'col1': np.random.random(50) * 100, 'col2': np.random.random(50)})
df.groupby('date').mean()

This will take all the rows which the same date, calculate the mean value of all the rows in the group for each column. 这将取所有具有相同日期的行,为每一列计算组中所有行的平均值。

This outputs on my sample : 这在我的样本上输出:

df.groupby('date').mean().head()
           col1      col2
date
0     42.881950  0.436073
1     32.114299  0.309742
2     96.819446  0.809071
3     30.606661  0.284257
4     40.690211  0.624972

For this input : 对于此输入:

df[df['date'] < 5]

    date       col1      col2
0      0  67.268605  0.393560
1      1  55.864578  0.508636
2      2  97.735942  0.861162
3      3  58.014599  0.117055
4      4   7.429489  0.637101
25     0  18.495296  0.478585
26     1   8.364020  0.110848
27     2  95.902950  0.756980
28     3   3.198724  0.451460
29     4  73.950932  0.612843

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM