简体   繁体   中英

Pandas Dataframe Python | How to compare a cell with another cell of a copied dataframe?

I have 2 same dataframes with different names (df_1 and df_2).

Lets say the dataframes have 2 columns Category and Time. For eg.

Category Time
A 2020-02-02 05:05:05.0000
A 2020-02-02 06:06:06.0000
A 2020-02-02 07:07:07.0000
B 2020-02-02 05:05:05.0000
B 2020-02-02 06:06:06.0000
C 2020-02-02 05:05:05.0000
C 2020-02-02 06:06:06.0000

I want the following if conditions: if category of df_1 matches with category of df_2 then, in a new dataframe(with columns: category, starttime, endtime), In case of A category, I want to put the first datetime(2020-02-02 05:05:05.0000) in starttime and last datetime (2020-02-02 07:07:07.0000) in endtime column.

Final Result new dataframe:

Category Start Time EndTime
A 2020-02-02 05:05:05.0000 2020-02-02 07:07:07.0000
B 2020-02-02 05:05:05.0000 2020-02-02 06:06:06.0000
C 2020-02-02 05:05:05.0000 2020-02-02 06:06:06.0000

How can I achieve this? Please help.

Solution for the original answer

pd.concat([df_1.groupby("CATEGORY").agg([min, max]),
           df_2.groupby("CATEGORY").agg([min, max])], 
        join="inner", axis=1).apply([min, max], axis=1)
    .rename(columns={"min":"START TIME", "max":"END TIME"})

Explanation

  1. First, you group each DataFrame by CATEGORY to keep the min and max of each of its value. This will also set the index to CATEGORY.

     grouped_1 = df_1.groupby("CATEGORY").agg([min, max]) grouped_2 = df_2.groupby("CATEGORY").agg([min, max])
  2. Then, you do an inner join to keep only the CATEGORies that are in both df_1 and df_2. By default, the inner join is done on the index, which is what we want here (column CATEGORY in our original DataFrames). You concatenate horizontally, getting 4 columns: two min and two max values per row.

     grouped_both = pd.concat([grouped_1, grouped_2], join="inner", axis=1)
  3. You keep the min and max values of each row, and rename the columns.

     final_df = grouped_both.apply([min, max], axis=1).rename(columns={"min":"START TIME", "max":"END TIME"})

NOTE: I assumed you wanted to merge the first and last timestamps of both DataFrames. If you truly wanted the start from df_1 and end from df_2, it would be a slightly different solution.

Solution for one DataFrame and adding duration

If I understood correctly, then you don't need to copy the original DataFrame.

# Group the DataFrame by CATEGORY and keep the min and max values
# We also need to get rid of the newly created MultiIndex level "TIME"
joined_df = df_1.groupby("CATEGORY").agg([min, max])["TIME"]
# Keep only rows where the min is different than the max
joined_df = joined_df[joined_df["min"]!= joined_df["max"]]
# Calculate the time deltas between min and max
# then cast it to a number value of the minutes
joined_df["DURATION"] = (joined_df[ "max"]- joined_df["min"]).astype('timedelta64[m]')
# We rename the columns min and max
joined_df = joined_df.rename(columns={"min":"START TIME", "max":"END TIME"})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM