[英]In a pandas dataframe, how can I set the value of other columns based on the data from one column, without using a loop?
I'm trying to build a dataframe that will be used for linear regression.我正在尝试构建一个用于线性回归的 dataframe。 I would like to include 11 independent "dummy" variables that are set to either 1 or 0 based on the month of the year.
我想包括 11 个独立的“虚拟”变量,这些变量根据一年中的月份设置为 1 或 0。 Without getting too far off topic, I'm using 11 variables instead of 12, as the 12th month is captured by the intercept.
在不离题太远的情况下,我使用 11 个变量而不是 12 个变量,因为第 12 个月被截距捕获。
I know many things can be done with pandas without looping through the entire dataframe, and doing things in that manner are typically faster than using a loop.我知道用 pandas 可以完成很多事情,而无需遍历整个 dataframe,并且以这种方式做事通常比使用循环更快。
So, is it possible to grab the month from my date column, and dynamically set a seperate column to either a 1 or a 0 based on that month?那么,是否可以从我的日期列中获取月份,并根据该月份将单独的列动态设置为 1 或 0? Or am I asking a stupid question?
还是我在问一个愚蠢的问题?
Edit: I should have included more information.编辑:我应该包含更多信息。 A dataframe is structured like this:
dataframe 的结构如下:
Date![]() |
sku![]() |
units ordered![]() |
sessions![]() |
conversion rate![]() |
---|---|---|---|---|
2020/01/30 ![]() |
abc123 ![]() |
20 ![]() |
200 ![]() |
0.1 ![]() |
2020/01/31 ![]() |
abc123 ![]() |
10 ![]() |
100 ![]() |
0.1 ![]() |
2020/02/01 ![]() |
abc123 ![]() |
15 ![]() |
60 ![]() |
0.25 ![]() |
I would like to make it look like this:我想让它看起来像这样:
Date![]() |
sku![]() |
units ordered![]() |
sessions![]() |
conversion rate![]() |
january![]() |
february![]() |
---|---|---|---|---|---|---|
2020/01/30 ![]() |
abc123 ![]() |
20 ![]() |
200 ![]() |
0.1 ![]() |
1 ![]() |
0 ![]() |
2020/01/31 ![]() |
abc123 ![]() |
10 ![]() |
100 ![]() |
0.1 ![]() |
1 ![]() |
0 ![]() |
2020/02/01 ![]() |
abc123 ![]() |
15 ![]() |
60 ![]() |
0.25 ![]() |
0 ![]() |
1 ![]() |
The code I'm currently using to accomplish this is:我目前用来完成此操作的代码是:
x = 1
while x < 12:
month = calendar.month_name[x]
df[month] = 0
x += 1
for index, row in df.iterrows():
d = row[0]
month = d.strftime("%B")
if not month == "December":
df.at[index, month] = 1
df.fillna(0, inplace=True)
Just not sure if this is the best way to accomplish this.只是不确定这是否是实现这一目标的最佳方式。
My approach would be to first get the month number from every month using dt.month :我的方法是首先使用dt.month从每个月获取月份数:
df['Date'].dt.month
0 1
1 1
2 2
Name: Date, dtype: int64
Then use crosstab with the index to get the tabulation of the counts:然后使用带有索引的交叉表来获取计数列表:
pd.crosstab(
df.index,
df['Date'].dt.month
)
Date 1 2
row_0
0 1 0
1 1 0
2 0 1
Then merge back to the DF on index:然后合并回索引上的 DF:
df = (
df.merge(pd.crosstab(
df.index,
df['Date'].dt.month
),
left_index=True,
right_index=True)
)
Output: Output:
Date sku units ordered sessions conversion rate 1 2
0 2020-01-30 abc123 20 200 0.10 1 0
1 2020-01-31 abc123 10 100 0.10 1 0
2 2020-02-01 abc123 15 60 0.25 0 1
Finally, rename the columns using a mapper generated with the calendar api :最后,使用使用日历 api生成的映射器重命名列:
df = df.rename(columns={month_num: calendar.month_name[month_num]
for month_num in range(1, 13)})
All together:全部一起:
import pandas as pd
import calendar
df = pd.DataFrame(
{'Date': {0: '2020/01/30', 1: '2020/01/31', 2: '2020/02/01'},
'sku': {0: 'abc123', 1: 'abc123', 2: 'abc123'},
'units ordered': {0: 20, 1: 10, 2: 15},
'sessions': {0: 200, 1: 100, 2: 60},
'conversion rate': {0: 0.1, 1: 0.1, 2: 0.25}})
df['Date'] = df['Date'].astype('datetime64[ns]')
df = (
df.merge(pd.crosstab(
df.index,
df['Date'].dt.month
),
left_index=True,
right_index=True)
)
df = df.rename(columns={month_num: calendar.month_name[month_num]
for month_num in range(1, 13)})
print(df.to_string())
Output: Output:
Date sku units ordered sessions conversion rate January February
0 2020-01-30 abc123 20 200 0.10 1 0
1 2020-01-31 abc123 10 100 0.10 1 0
2 2020-02-01 abc123 15 60 0.25 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.