[英]Create new column by comparing existing column in a DataFrame
I have the following DataFrame:我有以下 DataFrame:
datetime day_fetched col_a col_b
0 2023-01-02 12:00:00 2023-01-01 12:00:00 100 200
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 400
2 2023-01-03 12:00:00 2023-01-02 12:00:00 140 500
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 700
4 2023-01-04 12:00:00 2023-01-03 12:00:00 200 300
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 200
And I want to create a new column that will take the value 2 if there is a difference in the date between datetime
and day_fetched
and value 1 if there is no difference.我想创建一个新列,如果datetime
和day_fetched
之间的日期存在差异,则该列将取值 2,如果没有差异,则取值 1。
So my new Dataframe should look like this:所以我的新 Dataframe 应该是这样的:
datetime day_fetched col_a col_b day_ahead
0 2023-01-02 12:00:00 2023-01-01 12:00:00 100 200 2
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 400 1
2 2023-01-03 12:00:00 2023-01-02 12:00:00 140 500 2
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 700 1
4 2023-01-04 12:00:00 2023-01-03 12:00:00 200 300 2
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 200 1
Then based on the column[' day_ahead
'], I want to split the col_a
and col_b
, into col_a_1
and col_a_2
and col_b_1
and col_b_2
.然后基于列 [' day_ahead
'],我想将col_a
和col_b
拆分为col_a_1
和col_a_2
以及col_b_1
和col_b_2
。
So the final DataFrame will look like this:所以最终的 DataFrame 看起来像这样:
datetime day_fetched col_a_1 col_a_2 col_b_1 col_b_2 day_ahead
0 2023-01-02 12:00:00 2023-01-01 12:00:00 NaN 200 NaN 200 2
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 NaN 100 NaN 1
2 2023-01-03 12:00:00 2023-01-02 12:00:00 NaN 500 NaN 200 2
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 NaN 100 NaN 1
4 2023-01-04 12:00:00 2023-01-03 12:00:00 NaN 300 NaN 200 2
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 NaN 100 NaN 1
One solution is to use np.where
:一种解决方案是使用np.where
:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[["2023-01-02 12:00:00", "2023-01-01 12:00:00", 100, 200],
["2023-01-02 12:00:00", "2023-01-02 12:00:00", 120, 400],
["2023-01-03 12:00:00", "2023-01-02 12:00:00", 140, 500],
["2023-01-03 12:00:00", "2023-01-03 12:00:00", 160, 700],
["2023-01-04 12:00:00", "2023-01-03 12:00:00", 200, 300],
["2023-01-04 12:00:00", "2023-01-04 12:00:00", 430, 200]],
columns=["datetime","day_fetched","col_a","col_b"])
# days ahead
df["day_ahead"] = np.where(df["datetime"] == df["day_fetched"], 1, 2)
# column of None's for next section
df["na"] = None
# overwrite dataframe with new df
df = pd.DataFrame(data=np.where(df["day_ahead"] == 1,
[df["datetime"], df["day_fetched"],
df["col_a"], df["na"],
df["col_b"], df["na"],
df["day_ahead"]],
[df["datetime"], df["day_fetched"],
df["na"], df["col_a"],
df["na"], df["col_b"],
df["day_ahead"]]).T,
columns=["datetime", "day_fetched",
"col_a_1", "col_a_2",
"col_b_1", "col_b_2",
"day_ahead"])
df
# datetime day_fetched col_a_1 ... col_b_1 col_b_2 day_ahead
# 0 2023-01-02 12:00:00 2023-01-01 12:00:00 None ... None 200 2
# 1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 ... 400 None 1
# 2 2023-01-03 12:00:00 2023-01-02 12:00:00 None ... None 500 2
# 3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 ... 700 None 1
# 4 2023-01-04 12:00:00 2023-01-03 12:00:00 None ... None 300 2
# 5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 ... 200 None 1
# [6 rows x 7 columns]
When asking a question please provide data that can be easily copied, such as by using df.to_dict()
.提问时请提供可以轻松复制的数据,例如使用df.to_dict()
。
Here is a (more complicated) bit of code that uses a list comprehension to pivot based on the value of day_ahead
for each col_
and concatenates these to produce the same result:这是一段(更复杂的)代码,它根据每个col_
的day_ahead
值使用列表理解 pivot 并将它们连接起来以产生相同的结果:
df = pd.concat(
[df.pivot_table(index=[df.index, "datetime", "day_fetched"],
columns=["day_ahead"],
values=x).add_prefix(x+"_") for x in \
df.columns[df.columns.str.startswith("col_")]] + \
[df.set_index([df.index, "datetime", "day_fetched"])["day_ahead"]],
axis=1).reset_index(level=[1, 2])
The second, third and fourth lines above create the pivot table and adds the column name and "_"
as a prefix, and this is a list comprehension for each column in df
that starts with "col_"
(fifth line).上面的第二行、第三行和第四行创建了 pivot 表并添加了列名和"_"
作为前缀,这是对df
中以"col_"
开头的每一列的列表推导(第五行)。 The sixth and seventh lines add the day_ahead
column at the end of the DataFrame. The eighth line resets the index so that datetime
and day_fetched
are columns.第六行和第七行在 DataFrame 的末尾添加day_ahead
列。第八行重置索引,使datetime
和day_fetched
成为列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.