简体   繁体   中英

Pandas/Pythonic way to groupby a column X, within each group, return value in column Y based on value in column Z

Reproducible example:

df = pd.DataFrame([[1, '2015-12-15', 10],
                   [1, '2015-12-16', 13], 
                   [1, '2015-12-17', 16], 
                   [2, '2015-12-15', 19],
                   [2, '2015-12-11', 22], 
                   [2, '2015-12-18', 25],
                   [3, '2015-12-14', 28], 
                   [3, '2015-12-12', 31], 
                   [3, '2015-12-15', 34]])

df.columns = ['X', 'Y', 'Z']
print(df.dtypes)
print()
print(df)

The output of reproducible example and the datatype of each column:

X     int64
Y    object
Z     int64
dtype: object

   X           Y   Z
0  1  2015-12-15  10
1  1  2015-12-16  13
2  1  2015-12-17  16
3  2  2015-12-15  19
4  2  2015-12-11  22
5  2  2015-12-18  25
6  3  2015-12-14  28
7  3  2015-12-12  31
8  3  2015-12-15  34

Expected Output:

   X           Y   Z
0  1  2015-12-15  10
1  1  2015-12-15  10
2  2  2015-12-11  22
3  2  2015-12-15  19
4  3  2015-12-12  31
5  3  2015-12-15  34

Explanation of what that output is:

For every group in column X after grouping by X , I want one row with the value in column Z where value in column Y for that group is the min(all dates/object in column Y) and for the same group, another row with the value in column 'Z' where value in column Y for that group is the some custom date that definitely exists for all groups which will be hardcoded . So every group would have two rows.

In my output, For group 1 , value in column Z is 10 , because the value in column Z associated with the minimum of all dates in column Y for group 1 , 12-15-2015 is 10 . For the same group 1 , the second row for this group 1 , the value in column Z for the custom date 12-15-2015 is also 10 . For group 2 , min(all dates/objects in column Y) is 2015-12-11 , the corresponding value in column Z for group 2 with value in column Y , 2015-12-11 is 22 . And the for the custom date 12-15-2015 , it is 19 .

Here is what I'm assuming to be some linear time search/retarded code that I wrote to accomplish this:

uniqueXs = list(dict(Counter(df['X'].tolist())).keys()) #Get every unique item in column X is a list. 
df_list = [] #Empty list that will have rows of my final DataFrame

for x in uniqueXs: #Iterate through each unique value in column X

    idfiltered_dataframe = df.loc[df['X'] == x] #Filter DataFrame based on the current value in column X 
                                                #(iterating through list of all values)

    min_date = min(idfiltered_dataframe['Y']) #Min of column Y
    custom_date = '2015-12-15' #Every group WILL have this custom date.

    mindatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == min_date] #Within group, filter rows where column Y has minimum date
    customdatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == custom_date]  #Within group, filter rows where column Y has a custom date

    for row_1 in mindatefiltered_dataframe.index: #Iterate through mindatefiltered DataFrame and create list of each row value required

        row_list = [mindatefiltered_dataframe.at[row_1, 'X'], mindatefiltered_dataframe.at[row_1, 'Y'], mindatefiltered_dataframe.at[row_1, 'Z']]
        df_list.append(row_list) #Append to a master list

    for row_2 in customdatefiltered_dataframe.index: #Iterate through customdatefiltered DataFrame and create list of each row value required

        row_list = [customdatefiltered_dataframe.at[row_2, 'X'], customdatefiltered_dataframe.at[row_2, 'Y'], customdatefiltered_dataframe.at[row_2, 'Z']]
        df_list.append(row_list) #Append to a master list



print(pd.DataFrame(df_list)) #Create DataFrame out of the master list

I'm under the impression that there is some slick way, where you just do df.groupby.. and get the expected output and I'm hoping someone could provide me with this code to do that.

IIUC

g1=df.groupby('X').Y.value_counts().count(level=1).eq(df.X.nunique()) # get group1 , all date should show in three groups , we using value_counts
df.Y=pd.to_datetime(df.Y) # change to date format in order to sort
g2=df.sort_values('Y').groupby('X').head(1) # get the min date row . 

pd.concat([df.loc[df.Y.isin(g1[g1].index)],g2]).sort_index() # combine all together 
Out[280]: 
   X          Y   Z
0  1 2015-12-15  10
0  1 2015-12-15  10
3  2 2015-12-15  19
4  2 2015-12-11  22
7  3 2015-12-12  31
8  3 2015-12-15  34

Use -

date_fill = dt.datetime.strptime('2015-12-15', '%Y-%m-%d')
df['Y'] = pd.to_datetime(df['Y'], format='%Y-%m-%d')

df_g = df.loc[df.groupby(['X'])['Y'].idxmin()]
df2 = df[df['Y']==date_fill]
target_map = pd.Series(df2['Z'].tolist(),index=df2['X']).to_dict()
df_g.index = range(1, 2*len(df_g)+1, 2)
df_g = df_g.reindex(index=range(2*len(df_g)))
df_g['Y'] = df_g['Y'].fillna(date_fill)
df_g = df_g.bfill()
df_g.loc[df_g['Y']==date_fill, 'Z'] = df_g[df_g['Y']==date_fill]['X'].map(target_map)
df_g = df_g.bfill()
print(df_g)

Output

     X          Y     Z
0  1.0 2015-12-15  10.0
1  1.0 2015-12-15  10.0
2  2.0 2015-12-15  19.0
3  2.0 2015-12-11  22.0
4  3.0 2015-12-15  34.0
5  3.0 2015-12-12  31.0

Explanation

  1. Put the desired custom date in date_fill
  2. df.groupby(['X'])['Y'].idxmin() takes the rows by min of Y
  3. target_map is a dict created to preserve Z values later
  4. Next the df_g is expanded to have na values every alternate column
  5. df_g = df_g.bfill() comes twice in case you enter a date in date_fill that isn't present in the df . In that case target_map won't populate and you will end up getting na values.

I am sure this can be optimized somewhat, but the thought process should help you proceed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM