简体   繁体   English

Pandas / Pythonic方式将X列分组,在每个组中,根据Z列的值返回Y列的值

[英]Pandas/Pythonic way to groupby a column X, within each group, return value in column Y based on value in column Z

Reproducible example: 可重现的示例:

df = pd.DataFrame([[1, '2015-12-15', 10],
                   [1, '2015-12-16', 13], 
                   [1, '2015-12-17', 16], 
                   [2, '2015-12-15', 19],
                   [2, '2015-12-11', 22], 
                   [2, '2015-12-18', 25],
                   [3, '2015-12-14', 28], 
                   [3, '2015-12-12', 31], 
                   [3, '2015-12-15', 34]])

df.columns = ['X', 'Y', 'Z']
print(df.dtypes)
print()
print(df)

The output of reproducible example and the datatype of each column: 可重现示例的输出和每一列的数据类型:

X     int64
Y    object
Z     int64
dtype: object

   X           Y   Z
0  1  2015-12-15  10
1  1  2015-12-16  13
2  1  2015-12-17  16
3  2  2015-12-15  19
4  2  2015-12-11  22
5  2  2015-12-18  25
6  3  2015-12-14  28
7  3  2015-12-12  31
8  3  2015-12-15  34

Expected Output: 预期产量:

   X           Y   Z
0  1  2015-12-15  10
1  1  2015-12-15  10
2  2  2015-12-11  22
3  2  2015-12-15  19
4  3  2015-12-12  31
5  3  2015-12-15  34

Explanation of what that output is: 该输出是什么的说明:

For every group in column X after grouping by X , I want one row with the value in column Z where value in column Y for that group is the min(all dates/object in column Y) and for the same group, another row with the value in column 'Z' where value in column Y for that group is the some custom date that definitely exists for all groups which will be hardcoded . 对于每一个组中的列X由分组后X ,我想要一个行与列中的值Z其中在列值Y该组是min(all dates/object in column Y)和对于相同的基团,另一行与“ Z”列中的值,其中该组的Y列中的值是some custom date that definitely exists for all groups which will be hardcodedsome custom date that definitely exists for all groups which will be hardcoded So every group would have two rows. 因此,每个组将有两行。

In my output, For group 1 , value in column Z is 10 , because the value in column Z associated with the minimum of all dates in column Y for group 1 , 12-15-2015 is 10 . 在我的输出中,对于组1Z列的值为10 ,因为与组1 Y的所有日期中的最小日期相关联的Z列的值12-15-201510 For the same group 1 , the second row for this group 1 , the value in column Z for the custom date 12-15-2015 is also 10 . 出于同样的组1 ,对于该组第二行1 ,在列中的值Z为自定义日期12-15-2015也是10 For group 2 , min(all dates/objects in column Y) is 2015-12-11 , the corresponding value in column Z for group 2 with value in column Y , 2015-12-11 is 22 . 对于组2min(all dates/objects in column Y)2015-12-11 ,对于组2 Z列中具有Y列的相应值2015-12-1122 And the for the custom date 12-15-2015 , it is 19 . 而自定义日期12-15-2015年12月12-15-2015日为19

Here is what I'm assuming to be some linear time search/retarded code that I wrote to accomplish this: 我假设这是为完成此操作而编写的一些线性时间搜索/延迟代码:

uniqueXs = list(dict(Counter(df['X'].tolist())).keys()) #Get every unique item in column X is a list. 
df_list = [] #Empty list that will have rows of my final DataFrame

for x in uniqueXs: #Iterate through each unique value in column X

    idfiltered_dataframe = df.loc[df['X'] == x] #Filter DataFrame based on the current value in column X 
                                                #(iterating through list of all values)

    min_date = min(idfiltered_dataframe['Y']) #Min of column Y
    custom_date = '2015-12-15' #Every group WILL have this custom date.

    mindatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == min_date] #Within group, filter rows where column Y has minimum date
    customdatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == custom_date]  #Within group, filter rows where column Y has a custom date

    for row_1 in mindatefiltered_dataframe.index: #Iterate through mindatefiltered DataFrame and create list of each row value required

        row_list = [mindatefiltered_dataframe.at[row_1, 'X'], mindatefiltered_dataframe.at[row_1, 'Y'], mindatefiltered_dataframe.at[row_1, 'Z']]
        df_list.append(row_list) #Append to a master list

    for row_2 in customdatefiltered_dataframe.index: #Iterate through customdatefiltered DataFrame and create list of each row value required

        row_list = [customdatefiltered_dataframe.at[row_2, 'X'], customdatefiltered_dataframe.at[row_2, 'Y'], customdatefiltered_dataframe.at[row_2, 'Z']]
        df_list.append(row_list) #Append to a master list



print(pd.DataFrame(df_list)) #Create DataFrame out of the master list

I'm under the impression that there is some slick way, where you just do df.groupby.. and get the expected output and I'm hoping someone could provide me with this code to do that. 我的印象是,有一种df.groupby..方法,您只需要执行df.groupby..并获得预期的输出,我希望有人可以为我提供此代码。

IIUC IIUC

g1=df.groupby('X').Y.value_counts().count(level=1).eq(df.X.nunique()) # get group1 , all date should show in three groups , we using value_counts
df.Y=pd.to_datetime(df.Y) # change to date format in order to sort
g2=df.sort_values('Y').groupby('X').head(1) # get the min date row . 

pd.concat([df.loc[df.Y.isin(g1[g1].index)],g2]).sort_index() # combine all together 
Out[280]: 
   X          Y   Z
0  1 2015-12-15  10
0  1 2015-12-15  10
3  2 2015-12-15  19
4  2 2015-12-11  22
7  3 2015-12-12  31
8  3 2015-12-15  34

Use - 采用 -

date_fill = dt.datetime.strptime('2015-12-15', '%Y-%m-%d')
df['Y'] = pd.to_datetime(df['Y'], format='%Y-%m-%d')

df_g = df.loc[df.groupby(['X'])['Y'].idxmin()]
df2 = df[df['Y']==date_fill]
target_map = pd.Series(df2['Z'].tolist(),index=df2['X']).to_dict()
df_g.index = range(1, 2*len(df_g)+1, 2)
df_g = df_g.reindex(index=range(2*len(df_g)))
df_g['Y'] = df_g['Y'].fillna(date_fill)
df_g = df_g.bfill()
df_g.loc[df_g['Y']==date_fill, 'Z'] = df_g[df_g['Y']==date_fill]['X'].map(target_map)
df_g = df_g.bfill()
print(df_g)

Output 产量

     X          Y     Z
0  1.0 2015-12-15  10.0
1  1.0 2015-12-15  10.0
2  2.0 2015-12-15  19.0
3  2.0 2015-12-11  22.0
4  3.0 2015-12-15  34.0
5  3.0 2015-12-12  31.0

Explanation 说明

  1. Put the desired custom date in date_fill 将所需的自定义日期放入date_fill
  2. df.groupby(['X'])['Y'].idxmin() takes the rows by min of Y df.groupby(['X'])['Y'].idxmin()Ymin进行行
  3. target_map is a dict created to preserve Z values later target_map是为以后保留Z值而创建的字典
  4. Next the df_g is expanded to have na values every alternate column 接下来,将df_g扩展为每隔一列具有na
  5. df_g = df_g.bfill() comes twice in case you enter a date in date_fill that isn't present in the df . df_g = df_g.bfill()出现两次,以防您在date_fill中输入了df不存在的日期。 In that case target_map won't populate and you will end up getting na values. 在这种情况下,不会填充target_map ,您最终将获得na值。

I am sure this can be optimized somewhat, but the thought process should help you proceed. 我相信可以在某种程度上进行优化,但是思考过程应该可以帮助您继续前进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM