简体   繁体   English

通过比较 DataFrame 中的现有列来创建新列

[英]Create new column by comparing existing column in a DataFrame

I have the following DataFrame:我有以下 DataFrame:

   datetime            day_fetched         col_a col_b 
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200  
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400  
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500  
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700 
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300 
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200 

And I want to create a new column that will take the value 2 if there is a difference in the date between datetime and day_fetched and value 1 if there is no difference.我想创建一个新列,如果datetimeday_fetched之间的日期存在差异,则该列将取值 2,如果没有差异,则取值 1。

So my new Dataframe should look like this:所以我的新 Dataframe 应该是这样的:

   datetime            day_fetched         col_a col_b day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200    2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400    1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500    2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700    1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300    2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200    1

Then based on the column[' day_ahead '], I want to split the col_a and col_b , into col_a_1 and col_a_2 and col_b_1 and col_b_2 .然后基于列 [' day_ahead '],我想将col_acol_b拆分为col_a_1col_a_2以及col_b_1col_b_2

So the final DataFrame will look like this:所以最终的 DataFrame 看起来像这样:

   datetime            day_fetched         col_a_1 col_a_2 col_b_1 col_b_2 day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 NaN     200     NaN     200     2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120     NaN     100     NaN     1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 NaN     500     NaN     200     2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160     NaN     100     NaN     1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 NaN     300     NaN     200     2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430     NaN     100     NaN     1

One solution is to use np.where :一种解决方案是使用np.where

import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[["2023-01-02 12:00:00", "2023-01-01 12:00:00", 100,  200],
["2023-01-02 12:00:00", "2023-01-02 12:00:00", 120,  400],
["2023-01-03 12:00:00", "2023-01-02 12:00:00", 140,  500],
["2023-01-03 12:00:00", "2023-01-03 12:00:00", 160,  700],
["2023-01-04 12:00:00", "2023-01-03 12:00:00", 200,  300],
["2023-01-04 12:00:00", "2023-01-04 12:00:00", 430,  200]],
columns=["datetime","day_fetched","col_a","col_b"])

# days ahead
df["day_ahead"] = np.where(df["datetime"] == df["day_fetched"], 1, 2)
# column of None's for next section
df["na"] = None
# overwrite dataframe with new df
df = pd.DataFrame(data=np.where(df["day_ahead"] == 1,
                                [df["datetime"], df["day_fetched"],
                                 df["col_a"], df["na"],
                                 df["col_b"], df["na"],
                                 df["day_ahead"]],
                                [df["datetime"], df["day_fetched"],
                                 df["na"], df["col_a"],
                                 df["na"], df["col_b"],
                                 df["day_ahead"]]).T,
                  columns=["datetime", "day_fetched",
                           "col_a_1", "col_a_2",
                           "col_b_1", "col_b_2",
                           "day_ahead"])

df
#               datetime          day_fetched col_a_1  ... col_b_1 col_b_2 day_ahead
# 0  2023-01-02 12:00:00  2023-01-01 12:00:00    None  ...    None     200         2
# 1  2023-01-02 12:00:00  2023-01-02 12:00:00     120  ...     400    None         1
# 2  2023-01-03 12:00:00  2023-01-02 12:00:00    None  ...    None     500         2
# 3  2023-01-03 12:00:00  2023-01-03 12:00:00     160  ...     700    None         1
# 4  2023-01-04 12:00:00  2023-01-03 12:00:00    None  ...    None     300         2
# 5  2023-01-04 12:00:00  2023-01-04 12:00:00     430  ...     200    None         1

# [6 rows x 7 columns]

When asking a question please provide data that can be easily copied, such as by using df.to_dict() .提问时请提供可以轻松复制的数据,例如使用df.to_dict()

EDIT - Generalised for many columns编辑 - 对许多专栏进行概括

Here is a (more complicated) bit of code that uses a list comprehension to pivot based on the value of day_ahead for each col_ and concatenates these to produce the same result:这是一段(更复杂的)代码,它根据每个col_day_ahead值使用列表理解 pivot 并将它们连接起来以产生相同的结果:

df = pd.concat(
    [df.pivot_table(index=[df.index, "datetime", "day_fetched"],
                    columns=["day_ahead"],
                    values=x).add_prefix(x+"_") for x in \
     df.columns[df.columns.str.startswith("col_")]] + \
        [df.set_index([df.index, "datetime", "day_fetched"])["day_ahead"]],
        axis=1).reset_index(level=[1, 2])

The second, third and fourth lines above create the pivot table and adds the column name and "_" as a prefix, and this is a list comprehension for each column in df that starts with "col_" (fifth line).上面的第二行、第三行和第四行创建了 pivot 表并添加了列名和"_"作为前缀,这是对df中以"col_"开头的每一列的列表推导(第五行)。 The sixth and seventh lines add the day_ahead column at the end of the DataFrame. The eighth line resets the index so that datetime and day_fetched are columns.第六行和第七行在 DataFrame 的末尾添加day_ahead列。第八行重置索引,使datetimeday_fetched成为列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pandas:通过比较DataFrame的一列的DataFrame行来创建新列 - pandas: Create new column by comparing DataFrame rows of one column of DataFrame pandas:通过将 DataFrame 行与另一个 DataFrame 的列进行比较来创建新列 - pandas: Create new column by comparing DataFrame rows with columns of another DataFrame 使用条件从现有 dataframe 列创建新列 - Create new column from existing dataframe column with condition 基于数据框中的现有时间列创建新列 - Create a new column base on existing time column in a dataframe Python Spark - 如何创建一个新列,在数据帧上对现有列进行切片? - Python Spark - How to create a new column slicing an existing column on the dataframe? 使用基于现有列和字典的值创建新的数据框列? - Create new dataframe column with values based existing column AND on dictionary? 在 dataframe 中创建新列,将现有列传递给 SQL function - Create new column in dataframe passing existing column to an SQL function 如何从 pandas dataframe 中的现有列创建新列 - How to create a new column from an existing column in a pandas dataframe 根据现有列中的条件在 dataframe 中创建新列 - Create new column in dataframe based on conditions in existing columns 如何根据python中现有列中的条件创建新列? - How to create a new column based on conditions in the existing columns in a dataframe in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM