在 pandas dataframe 中，如何在不使用循环的情况下根据一列的数据设置其他列的值？

Question

I'm trying to build a dataframe that will be used for linear regression.我正在尝试构建一个用于线性回归的 dataframe。 I would like to include 11 independent "dummy" variables that are set to either 1 or 0 based on the month of the year.我想包括 11 个独立的“虚拟”变量，这些变量根据一年中的月份设置为 1 或 0。 Without getting too far off topic, I'm using 11 variables instead of 12, as the 12th month is captured by the intercept.在不离题太远的情况下，我使用 11 个变量而不是 12 个变量，因为第 12 个月被截距捕获。

I know many things can be done with pandas without looping through the entire dataframe, and doing things in that manner are typically faster than using a loop.我知道用 pandas 可以完成很多事情，而无需遍历整个 dataframe，并且以这种方式做事通常比使用循环更快。

So, is it possible to grab the month from my date column, and dynamically set a seperate column to either a 1 or a 0 based on that month?那么，是否可以从我的日期列中获取月份，并根据该月份将单独的列动态设置为 1 或 0？ Or am I asking a stupid question?还是我在问一个愚蠢的问题？

Edit: I should have included more information.编辑：我应该包含更多信息。 A dataframe is structured like this: dataframe 的结构如下：

Date日期	sku库存单位	units ordered订购单位	sessions会话	conversion rate兑换率
2020/01/30 2020/01/30	abc123 abc123	20 20	200 200	0.1 0.1
2020/01/31 2020/01/31	abc123 abc123	10 10	100 100	0.1 0.1
2020/02/01 2020/02/01	abc123 abc123	15 15	60 60	0.25 0.25

I would like to make it look like this:我想让它看起来像这样：

Date日期	sku库存单位	units ordered订购单位	sessions会话	conversion rate兑换率	january一月	february二月
2020/01/30 2020/01/30	abc123 abc123	20 20	200 200	0.1 0.1	1 1	0 0
2020/01/31 2020/01/31	abc123 abc123	10 10	100 100	0.1 0.1	1 1	0 0
2020/02/01 2020/02/01	abc123 abc123	15 15	60 60	0.25 0.25	0 0	1 1

The code I'm currently using to accomplish this is:我目前用来完成此操作的代码是：

x = 1
while x < 12:
    month = calendar.month_name[x]
    df[month] = 0
    x += 1

for index, row in df.iterrows():
    d = row[0]
    month = d.strftime("%B")
    if not month == "December":
        df.at[index, month] = 1

    df.fillna(0, inplace=True)

Just not sure if this is the best way to accomplish this.只是不确定这是否是实现这一目标的最佳方式。

Answer 1

My approach would be to first get the month number from every month using dt.month :我的方法是首先使用dt.month从每个月获取月份数：

df['Date'].dt.month

0    1
1    1
2    2
Name: Date, dtype: int64

Then use crosstab with the index to get the tabulation of the counts:然后使用带有索引的交叉表来获取计数列表：

pd.crosstab(
        df.index,
        df['Date'].dt.month
    )

Date   1  2
row_0      
0      1  0
1      1  0
2      0  1

Then merge back to the DF on index:然后合并回索引上的 DF：

df = (
    df.merge(pd.crosstab(
        df.index,
        df['Date'].dt.month
    ),
        left_index=True,
        right_index=True)
)

Output: Output：

        Date     sku  units ordered  sessions  conversion rate  1  2
0 2020-01-30  abc123             20       200             0.10  1  0
1 2020-01-31  abc123             10       100             0.10  1  0
2 2020-02-01  abc123             15        60             0.25  0  1

Finally, rename the columns using a mapper generated with the calendar api :最后，使用使用日历 api生成的映射器重命名列：

df = df.rename(columns={month_num: calendar.month_name[month_num]
                        for month_num in range(1, 13)})

All together:全部一起：

import pandas as pd
import calendar

df = pd.DataFrame(
    {'Date': {0: '2020/01/30', 1: '2020/01/31', 2: '2020/02/01'},
     'sku': {0: 'abc123', 1: 'abc123', 2: 'abc123'},
     'units ordered': {0: 20, 1: 10, 2: 15},
     'sessions': {0: 200, 1: 100, 2: 60},
     'conversion rate': {0: 0.1, 1: 0.1, 2: 0.25}})
df['Date'] = df['Date'].astype('datetime64[ns]')

df = (
    df.merge(pd.crosstab(
        df.index,
        df['Date'].dt.month
    ),
        left_index=True,
        right_index=True)
)

df = df.rename(columns={month_num: calendar.month_name[month_num]
                        for month_num in range(1, 13)})

print(df.to_string())

Output: Output：

        Date     sku  units ordered  sessions  conversion rate  January  February
0 2020-01-30  abc123             20       200             0.10        1         0
1 2020-01-31  abc123             10       100             0.10        1         0
2 2020-02-01  abc123             15        60             0.25        0         1

在 pandas dataframe 中，如何在不使用循环的情况下根据一列的数据设置其他列的值？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-27 20:45:28

在 pandas dataframe 中，如何在不使用循环的情况下根据一列的数据设置其他列的值？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-27 20:45:28

解决方案1
0 已采纳 2021-04-27 20:45:28