Reproducible example:
df = pd.DataFrame([[1, '2015-12-15', 10],
[1, '2015-12-16', 13],
[1, '2015-12-17', 16],
[2, '2015-12-15', 19],
[2, '2015-12-11', 22],
[2, '2015-12-18', 25],
[3, '2015-12-14', 28],
[3, '2015-12-12', 31],
[3, '2015-12-15', 34]])
df.columns = ['X', 'Y', 'Z']
print(df.dtypes)
print()
print(df)
The output of reproducible example and the datatype of each column:
X int64
Y object
Z int64
dtype: object
X Y Z
0 1 2015-12-15 10
1 1 2015-12-16 13
2 1 2015-12-17 16
3 2 2015-12-15 19
4 2 2015-12-11 22
5 2 2015-12-18 25
6 3 2015-12-14 28
7 3 2015-12-12 31
8 3 2015-12-15 34
Expected Output:
X Y Z
0 1 2015-12-15 10
1 1 2015-12-15 10
2 2 2015-12-11 22
3 2 2015-12-15 19
4 3 2015-12-12 31
5 3 2015-12-15 34
Explanation of what that output is:
For every group in column X
after grouping by X
, I want one row with the value in column Z
where value in column Y
for that group is the min(all dates/object in column Y)
and for the same group, another row with the value in column 'Z' where value in column Y
for that group is the some custom date that definitely exists for all groups which will be hardcoded
. So every group would have two rows.
In my output, For group 1
, value in column Z
is 10
, because the value in column Z
associated with the minimum of all dates in column Y
for group 1
, 12-15-2015
is 10
. For the same group 1
, the second row for this group 1
, the value in column Z
for the custom date 12-15-2015
is also 10
. For group 2
, min(all dates/objects in column Y)
is 2015-12-11
, the corresponding value in column Z
for group 2
with value in column Y
, 2015-12-11
is 22
. And the for the custom date 12-15-2015
, it is 19
.
Here is what I'm assuming to be some linear time search/retarded code that I wrote to accomplish this:
uniqueXs = list(dict(Counter(df['X'].tolist())).keys()) #Get every unique item in column X is a list.
df_list = [] #Empty list that will have rows of my final DataFrame
for x in uniqueXs: #Iterate through each unique value in column X
idfiltered_dataframe = df.loc[df['X'] == x] #Filter DataFrame based on the current value in column X
#(iterating through list of all values)
min_date = min(idfiltered_dataframe['Y']) #Min of column Y
custom_date = '2015-12-15' #Every group WILL have this custom date.
mindatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == min_date] #Within group, filter rows where column Y has minimum date
customdatefiltered_dataframe = idfiltered_dataframe.loc[idfiltered_dataframe['Y'] == custom_date] #Within group, filter rows where column Y has a custom date
for row_1 in mindatefiltered_dataframe.index: #Iterate through mindatefiltered DataFrame and create list of each row value required
row_list = [mindatefiltered_dataframe.at[row_1, 'X'], mindatefiltered_dataframe.at[row_1, 'Y'], mindatefiltered_dataframe.at[row_1, 'Z']]
df_list.append(row_list) #Append to a master list
for row_2 in customdatefiltered_dataframe.index: #Iterate through customdatefiltered DataFrame and create list of each row value required
row_list = [customdatefiltered_dataframe.at[row_2, 'X'], customdatefiltered_dataframe.at[row_2, 'Y'], customdatefiltered_dataframe.at[row_2, 'Z']]
df_list.append(row_list) #Append to a master list
print(pd.DataFrame(df_list)) #Create DataFrame out of the master list
I'm under the impression that there is some slick way, where you just do df.groupby..
and get the expected output and I'm hoping someone could provide me with this code to do that.
IIUC
g1=df.groupby('X').Y.value_counts().count(level=1).eq(df.X.nunique()) # get group1 , all date should show in three groups , we using value_counts
df.Y=pd.to_datetime(df.Y) # change to date format in order to sort
g2=df.sort_values('Y').groupby('X').head(1) # get the min date row .
pd.concat([df.loc[df.Y.isin(g1[g1].index)],g2]).sort_index() # combine all together
Out[280]:
X Y Z
0 1 2015-12-15 10
0 1 2015-12-15 10
3 2 2015-12-15 19
4 2 2015-12-11 22
7 3 2015-12-12 31
8 3 2015-12-15 34
Use -
date_fill = dt.datetime.strptime('2015-12-15', '%Y-%m-%d')
df['Y'] = pd.to_datetime(df['Y'], format='%Y-%m-%d')
df_g = df.loc[df.groupby(['X'])['Y'].idxmin()]
df2 = df[df['Y']==date_fill]
target_map = pd.Series(df2['Z'].tolist(),index=df2['X']).to_dict()
df_g.index = range(1, 2*len(df_g)+1, 2)
df_g = df_g.reindex(index=range(2*len(df_g)))
df_g['Y'] = df_g['Y'].fillna(date_fill)
df_g = df_g.bfill()
df_g.loc[df_g['Y']==date_fill, 'Z'] = df_g[df_g['Y']==date_fill]['X'].map(target_map)
df_g = df_g.bfill()
print(df_g)
Output
X Y Z
0 1.0 2015-12-15 10.0
1 1.0 2015-12-15 10.0
2 2.0 2015-12-15 19.0
3 2.0 2015-12-11 22.0
4 3.0 2015-12-15 34.0
5 3.0 2015-12-12 31.0
Explanation
date_fill
df.groupby(['X'])['Y'].idxmin()
takes the rows by min
of Y
target_map
is a dict created to preserve Z
values later df_g
is expanded to have na
values every alternate column df_g = df_g.bfill()
comes twice in case you enter a date in date_fill
that isn't present in the df
. In that case target_map
won't populate and you will end up getting na
values. I am sure this can be optimized somewhat, but the thought process should help you proceed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.