[英]Pandas Dataframe Python | How to compare a cell with another cell of a copied dataframe?
I have 2 same dataframes with different names (df_1 and df_2).我有 2 个不同名称的相同数据框(df_1 和 df_2)。
Lets say the dataframes have 2 columns Category and Time.假设数据框有 2 列类别和时间。 For eg.例如。
Category类别 | Time时间 |
---|---|
A一个 | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 |
A一个 | 2020-02-02 06:06:06.0000 2020-02-02 06:06:06.0000 |
A一个 | 2020-02-02 07:07:07.0000 2020-02-02 07:07:07.0000 |
B乙 | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 |
B乙 | 2020-02-02 06:06:06.0000 2020-02-02 06:06:06.0000 |
C C | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 |
C C | 2020-02-02 06:06:06.0000 2020-02-02 06:06:06.0000 |
I want the following if conditions: if category of df_1 matches with category of df_2 then, in a new dataframe(with columns: category, starttime, endtime), In case of A category, I want to put the first datetime(2020-02-02 05:05:05.0000) in starttime and last datetime (2020-02-02 07:07:07.0000) in endtime column.我想要以下 if 条件:如果 df_1 的类别与 df_2 的类别匹配,那么,在一个新的数据帧中(列:类别、开始时间、结束时间),如果是 A 类别,我想放置第一个日期时间(2020-02 -02 05:05:05.0000) 在结束时间列中的开始时间和最后日期时间 (2020-02-02 07:07:07.0000)。
Final Result new dataframe:最终结果新 dataframe:
Category类别 | Start Time开始时间 | EndTime时间结束 |
---|---|---|
A一个 | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 | 2020-02-02 07:07:07.0000 2020-02-02 07:07:07.0000 |
B乙 | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 | 2020-02-02 06:06:06.0000 2020-02-02 06:06:06.0000 |
C C | 2020-02-02 05:05:05.0000 2020-02-02 05:05:05.0000 | 2020-02-02 06:06:06.0000 2020-02-02 06:06:06.0000 |
How can I achieve this?我怎样才能做到这一点? Please help.请帮忙。
pd.concat([df_1.groupby("CATEGORY").agg([min, max]),
df_2.groupby("CATEGORY").agg([min, max])],
join="inner", axis=1).apply([min, max], axis=1)
.rename(columns={"min":"START TIME", "max":"END TIME"})
First, you group each DataFrame by CATEGORY to keep the min and max of each of its value.首先,您按类别对每个 DataFrame 进行分组,以保持其每个值的最小值和最大值。 This will also set the index to CATEGORY.这也会将索引设置为 CATEGORY。
grouped_1 = df_1.groupby("CATEGORY").agg([min, max]) grouped_2 = df_2.groupby("CATEGORY").agg([min, max])
Then, you do an inner join to keep only the CATEGORies that are in both df_1 and df_2.然后,您执行内部连接以仅保留 df_1 和 df_2 中的 CATEGOries。 By default, the inner join is done on the index, which is what we want here (column CATEGORY in our original DataFrames).默认情况下,内部连接是在索引上完成的,这就是我们在这里想要的(我们原始 DataFrame 中的列 CATEGORY)。 You concatenate horizontally, getting 4 columns: two min and two max values per row.您水平连接,得到 4 列:每行两个最小值和两个最大值。
grouped_both = pd.concat([grouped_1, grouped_2], join="inner", axis=1)
You keep the min and max values of each row, and rename the columns.您保留每行的最小值和最大值,并重命名列。
final_df = grouped_both.apply([min, max], axis=1).rename(columns={"min":"START TIME", "max":"END TIME"})
NOTE: I assumed you wanted to merge the first and last timestamps of both DataFrames.注意:我假设您想合并两个 DataFrame 的第一个和最后一个时间戳。 If you truly wanted the start from df_1 and end from df_2, it would be a slightly different solution.如果您真的想要从 df_1 开始并从 df_2 结束,那将是一个稍微不同的解决方案。
If I understood correctly, then you don't need to copy the original DataFrame.如果我理解正确,那么你不需要复制原来的DataFrame。
# Group the DataFrame by CATEGORY and keep the min and max values
# We also need to get rid of the newly created MultiIndex level "TIME"
joined_df = df_1.groupby("CATEGORY").agg([min, max])["TIME"]
# Keep only rows where the min is different than the max
joined_df = joined_df[joined_df["min"]!= joined_df["max"]]
# Calculate the time deltas between min and max
# then cast it to a number value of the minutes
joined_df["DURATION"] = (joined_df[ "max"]- joined_df["min"]).astype('timedelta64[m]')
# We rename the columns min and max
joined_df = joined_df.rename(columns={"min":"START TIME", "max":"END TIME"})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.