繁体   English   中英

Pandas 根据列值删除和更新行和列

[英]Pandas drop and update rows and columns based on column value

这是板球得分的示例 csv 文件:

>>> df
 venue  ball  run extra wide  noball
0   a   0.1     0   1   NaN     NaN
1   a   0.2     4   0   NaN     NaN
2   a   0.3     1   5   5.0     NaN
3   a   0.4     1   0   NaN     NaN
4   a   0.5     1   1   NaN     1.0
5   a   0.6     2   1   NaN     NaN
6   a   0.7     6   2   1.0     1.0
7   a   0.8     0   0   NaN     NaN
8   a   0.9     1   1   NaN     NaN
9   a   1.1     2   2   NaN     NaN
10  a   1.2     1   0   NaN     NaN
11  a   1.3     6   1   NaN     NaN
12  a   1.4     0   2   NaN     2.0
13  a   1.5     1   0   NaN     NaN
14  a   1.6     2   0   NaN     NaN
15  a   1.7     0   1   NaN     NaN
16  a   0.1     0   5   NaN     NaN
17  a   0.2     4   0   NaN     NaN
18  a   0.3     1   1   NaN     NaN
19  a   0.4     3   0   NaN     NaN
20  a   0.5     0   0   NaN     NaN
21  a   0.6     0   2   2.0     NaN
22  a   0.7     6   1   NaN     NaN
23  a   1.1     4   0   NaN     NaN

从这个数据框中,我想更新球值,生成 2 个新列并删除 4 个整列。 条件是

  1. 当“wide”或“noball”为空时,crun = crun + run + extra until ball = 0.1(递归)
  2. 当“wide”或“noball”不为空时,并发球值不会增加,在crun计算后会被丢弃。 crun = crun + 运行 + 额外。 它会一直持续到球 = 0.1(递归),例如。 让我细分:从行索引 0 到 8: | 0.1 "wide" 或 "noball" 为空且 crun = 1 | 0.2 "wide" 或 "noball" 为空且 crun = 1+4 =5| 0.3 "wide" 或 "noball" 不为空(已删除) | 0.4 "wide" 或 "noball" 为空(变为 0.3)并且 crun = 5+1+5+1 = 12| 0.5 "wide" or "noball" 不为空(已移除) | 0.6 "wide" 或 "noball" 为空(变为 0.4)且 crun = 12+1+1+2+1 =17| 0.7 "wide" 或 "noball" 不为空(已删除) | 0.8 "wide" 或 "noball" 为空(变为 0.5)并且 crun = 17+6+2 = 25| 0.9 "wide" 或 "noball" 为空(变为 0.6)并且 crun = 25+1+1 =27|
  3. 最后将创建“total”列,它返回 crun 的最大值,直到 ball = 0.1(递归)。 然后应该删除“run”、“extra”、“wide”、“noball”列。

我想要的输出:

  venue ball  crun total
0   a   0.1     1   45
1   a   0.2     5   45
2   a   0.3     12  45
3   a   0.4     17  45
4   a   0.5     25  45
5   a   0.6     27  45
6   a   1.1     31  45
7   a   1.2     32  45
8   a   1.3     39  45
9   a   1.4     42  45
10  a   1.5     44  45
11  a   1.6     45  45
12  a   0.1     5   27
13  a   0.2     9   27
14  a   0.3     11  27
15  a   0.4     14  27
16  a   0.5     14  27
17  a   0.6     23  27
18  a   1.1     27  27

我觉得太复杂了,请帮忙。 我试过的代码:

df = pd.read_csv("data.csv")
gr = df.groupby(df.ball.eq(0.1).cumsum())
df["crun"] = gr.runs.cumsum()
df["total"] = gr.current_run.transform("max")
df = df.drop(['run', 'extra', 'wide', 'noball'], axis=1)

好吧。 这是一个有趣的。

(为了清楚起见,我尝试添加注释。)

注意:“ball”、“run”、“extra”、“wide”和“noball”都是数字字段。

注意注意:这一切都假设您的初始 DataFrame 位于名为df的变量下。

    # Create target groupings by ball value.
    df["target_groups"] = df.loc[df["ball"] == 0.1].groupby(level=-1).ngroup()
    df["target_groups"].fillna(method="ffill", inplace=True)
    

    # --- Create subgroups --- #
    df["target_subgroups"] = df["ball"].astype(int)


    # Add field fro sum of run and extra
    df["run_extra"] = df[["run", "extra"]].sum(axis=1)
    
    # Apply groupby() and cumsum() as follows to get the cumulative sum
    # of each ball group for run and extra.
    df["crun"] = df.groupby(["target_groups"])["run_extra"].cumsum()
    
    # Create dataframe for max crun value of each group
    group_max_df = df.groupby(["target_groups"])["crun"].max().to_frame().reset_index()
    
    # Merge both of the DataFrames with the given suffixes.  The first one
    # Just prevents crun from having a suffix added, which is an additional
    # step to remove.
    # You could probably use .join() in a similar manner.
    df = pd.merge(df, group_max_df,
        on=["target_groups"],
        suffixes=("", "_total"),
        sort=False
    )
    # Rename your new total field.
    df.rename(columns={"crun_total": "total"}, inplace = True)
    
    # Apply your wide and noball condition here.
    df = df[(df["wide"].isna()) & (df["noball"].isna())].copy()
    

    # -- Reset `ball` column -- #
    # Add temp column with static value
    df["tmp_ball"] = 0.1

    # Generate cumulative sum by subgroup.
    # Set `ball` to modulo 0.6
    df.loc[:, "ball"] = df.groupby(["target_subgroups"])["tmp_ball"].cumsum() % 0.6

    # Find rows where ball == 0.0 and set those to 0.6
    df.loc[df["ball"] == 0.0, "ball"] = 0.6

    # Add ball and target_subgroups columns to get final ball value.
    df["ball"]  = df["ball"] + df["target_subgroups"]


    # Reset your main index, if desired
    df.reset_index(drop=True, inplace=True)


    # Select only desired field for output.
    df = df.loc[:, ["venue","ball","crun","total"]].copy()

df输出:

   venue  ball  crun  total
0      a   0.1     1     45
1      a   0.2     5     45
2      a   0.4    12     45
3      a   0.6    17     45
4      a   0.8    25     45
5      a   0.9    27     45
6      a   1.1    31     45
7      a   1.2    32     45
8      a   1.3    39     45
9      a   1.5    42     45
10     a   1.6    44     45
11     a   1.7    45     45
12     a   0.1     5     27
13     a   0.2     9     27
14     a   0.3    11     27
15     a   0.4    14     27
16     a   0.5    14     27
17     a   0.7    23     27
18     a   1.1    27     27

编辑:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM