Pandas Dataframe Python | How to compare a cell with another cell of a copied dataframe?

Question

I have 2 same dataframes with different names (df_1 and df_2).

Lets say the dataframes have 2 columns Category and Time. For eg.

Category	Time
A	2020-02-02 05:05:05.0000
A	2020-02-02 06:06:06.0000
A	2020-02-02 07:07:07.0000
B	2020-02-02 05:05:05.0000
B	2020-02-02 06:06:06.0000
C	2020-02-02 05:05:05.0000
C	2020-02-02 06:06:06.0000

I want the following if conditions: if category of df_1 matches with category of df_2 then, in a new dataframe(with columns: category, starttime, endtime), In case of A category, I want to put the first datetime(2020-02-02 05:05:05.0000) in starttime and last datetime (2020-02-02 07:07:07.0000) in endtime column.

Final Result new dataframe:

Category	Start Time	EndTime
A	2020-02-02 05:05:05.0000	2020-02-02 07:07:07.0000
B	2020-02-02 05:05:05.0000	2020-02-02 06:06:06.0000
C	2020-02-02 05:05:05.0000	2020-02-02 06:06:06.0000

How can I achieve this? Please help.

Answer 1

Solution for the original answer

pd.concat([df_1.groupby("CATEGORY").agg([min, max]),
           df_2.groupby("CATEGORY").agg([min, max])], 
        join="inner", axis=1).apply([min, max], axis=1)
    .rename(columns={"min":"START TIME", "max":"END TIME"})

Explanation

First, you group each DataFrame by CATEGORY to keep the min and max of each of its value. This will also set the index to CATEGORY.
```
 grouped_1 = df_1.groupby("CATEGORY").agg([min, max]) grouped_2 = df_2.groupby("CATEGORY").agg([min, max])
```
Then, you do an inner join to keep only the CATEGORies that are in both df_1 and df_2. By default, the inner join is done on the index, which is what we want here (column CATEGORY in our original DataFrames). You concatenate horizontally, getting 4 columns: two min and two max values per row.
```
 grouped_both = pd.concat([grouped_1, grouped_2], join="inner", axis=1)
```

You keep the min and max values of each row, and rename the columns.

 final_df = grouped_both.apply([min, max], axis=1).rename(columns={"min":"START TIME", "max":"END TIME"})

NOTE: I assumed you wanted to merge the first and last timestamps of both DataFrames. If you truly wanted the start from df_1 and end from df_2, it would be a slightly different solution.

Solution for one DataFrame and adding duration

If I understood correctly, then you don't need to copy the original DataFrame.

# Group the DataFrame by CATEGORY and keep the min and max values
# We also need to get rid of the newly created MultiIndex level "TIME"
joined_df = df_1.groupby("CATEGORY").agg([min, max])["TIME"]
# Keep only rows where the min is different than the max
joined_df = joined_df[joined_df["min"]!= joined_df["max"]]
# Calculate the time deltas between min and max
# then cast it to a number value of the minutes
joined_df["DURATION"] = (joined_df[ "max"]- joined_df["min"]).astype('timedelta64[m]')
# We rename the columns min and max
joined_df = joined_df.rename(columns={"min":"START TIME", "max":"END TIME"})

Pandas Dataframe Python | How to compare a cell with another cell of a copied dataframe?

Question

1 answers

solution1
1 ACCPTED 2020-12-08 20:28:10

Solution for the original answer

Explanation

Solution for one DataFrame and adding duration

Pandas Dataframe Python | How to compare a cell with another cell of a copied dataframe?

Question

1 answers

solution1 1 ACCPTED 2020-12-08 20:28:10

Solution for the original answer

Explanation

Solution for one DataFrame and adding duration

solution1
1 ACCPTED 2020-12-08 20:28:10