通过比较 DataFrame 中的现有列来创建新列

Question

I have the following DataFrame:我有以下 DataFrame：

   datetime            day_fetched         col_a col_b 
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200  
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400  
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500  
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700 
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300 
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200

And I want to create a new column that will take the value 2 if there is a difference in the date between datetime and day_fetched and value 1 if there is no difference.我想创建一个新列，如果datetime和day_fetched之间的日期存在差异，则该列将取值 2，如果没有差异，则取值 1。

So my new Dataframe should look like this:所以我的新 Dataframe 应该是这样的：

   datetime            day_fetched         col_a col_b day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200    2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400    1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500    2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700    1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300    2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200    1

Then based on the column[' day_ahead '], I want to split the col_a and col_b , into col_a_1 and col_a_2 and col_b_1 and col_b_2 .然后基于列 [' day_ahead ']，我想将col_a和col_b拆分为col_a_1和col_a_2以及col_b_1和col_b_2 。

So the final DataFrame will look like this:所以最终的 DataFrame 看起来像这样：

   datetime            day_fetched         col_a_1 col_a_2 col_b_1 col_b_2 day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 NaN     200     NaN     200     2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120     NaN     100     NaN     1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 NaN     500     NaN     200     2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160     NaN     100     NaN     1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 NaN     300     NaN     200     2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430     NaN     100     NaN     1

Answer 1

One solution is to use np.where :一种解决方案是使用np.where ：

import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[["2023-01-02 12:00:00", "2023-01-01 12:00:00", 100,  200],
["2023-01-02 12:00:00", "2023-01-02 12:00:00", 120,  400],
["2023-01-03 12:00:00", "2023-01-02 12:00:00", 140,  500],
["2023-01-03 12:00:00", "2023-01-03 12:00:00", 160,  700],
["2023-01-04 12:00:00", "2023-01-03 12:00:00", 200,  300],
["2023-01-04 12:00:00", "2023-01-04 12:00:00", 430,  200]],
columns=["datetime","day_fetched","col_a","col_b"])

# days ahead
df["day_ahead"] = np.where(df["datetime"] == df["day_fetched"], 1, 2)
# column of None's for next section
df["na"] = None
# overwrite dataframe with new df
df = pd.DataFrame(data=np.where(df["day_ahead"] == 1,
                                [df["datetime"], df["day_fetched"],
                                 df["col_a"], df["na"],
                                 df["col_b"], df["na"],
                                 df["day_ahead"]],
                                [df["datetime"], df["day_fetched"],
                                 df["na"], df["col_a"],
                                 df["na"], df["col_b"],
                                 df["day_ahead"]]).T,
                  columns=["datetime", "day_fetched",
                           "col_a_1", "col_a_2",
                           "col_b_1", "col_b_2",
                           "day_ahead"])

df
#               datetime          day_fetched col_a_1  ... col_b_1 col_b_2 day_ahead
# 0  2023-01-02 12:00:00  2023-01-01 12:00:00    None  ...    None     200         2
# 1  2023-01-02 12:00:00  2023-01-02 12:00:00     120  ...     400    None         1
# 2  2023-01-03 12:00:00  2023-01-02 12:00:00    None  ...    None     500         2
# 3  2023-01-03 12:00:00  2023-01-03 12:00:00     160  ...     700    None         1
# 4  2023-01-04 12:00:00  2023-01-03 12:00:00    None  ...    None     300         2
# 5  2023-01-04 12:00:00  2023-01-04 12:00:00     430  ...     200    None         1

# [6 rows x 7 columns]

When asking a question please provide data that can be easily copied, such as by using df.to_dict() .提问时请提供可以轻松复制的数据，例如使用df.to_dict() 。

EDIT - Generalised for many columns编辑 - 对许多专栏进行概括

Here is a (more complicated) bit of code that uses a list comprehension to pivot based on the value of day_ahead for each col_ and concatenates these to produce the same result:这是一段（更复杂的）代码，它根据每个col_的day_ahead值使用列表理解 pivot 并将它们连接起来以产生相同的结果：

df = pd.concat(
    [df.pivot_table(index=[df.index, "datetime", "day_fetched"],
                    columns=["day_ahead"],
                    values=x).add_prefix(x+"_") for x in \
     df.columns[df.columns.str.startswith("col_")]] + \
        [df.set_index([df.index, "datetime", "day_fetched"])["day_ahead"]],
        axis=1).reset_index(level=[1, 2])

The second, third and fourth lines above create the pivot table and adds the column name and "_" as a prefix, and this is a list comprehension for each column in df that starts with "col_" (fifth line).上面的第二行、第三行和第四行创建了 pivot 表并添加了列名和"_"作为前缀，这是对df中以"col_"开头的每一列的列表推导（第五行）。 The sixth and seventh lines add the day_ahead column at the end of the DataFrame. The eighth line resets the index so that datetime and day_fetched are columns.第六行和第七行在 DataFrame 的末尾添加day_ahead列。第八行重置索引，使datetime和day_fetched成为列。

通过比较 DataFrame 中的现有列来创建新列

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-20 12:26:01

EDIT - Generalised for many columns编辑 - 对许多专栏进行概括

通过比较 DataFrame 中的现有列来创建新列

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-20 12:26:01

EDIT - Generalised for many columns编辑 - 对许多专栏进行概括

解决方案1
1 已采纳 2023-01-20 12:26:01