Polars - 将列中的部分字符串替换为其他列的值

Question

所以我有一个 Polars dataframe 看起来像这样

df = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+$i Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

每当 Stat0 包含 i$ 时，我想将“Stat0”列中的 $i 替换为 Amount

我尝试了几种不同的方法，例如：

df = df.with_column(
    pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)

预期结果

result = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+7 Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

但这似乎不起作用。

我希望有人能帮帮忙。

此致

Answer 1

问题是replace方法不接受表达式，只接受一个常量。 因此，我们不能使用列作为替换值，只能使用常量。

我们可以通过两种方式解决这个问题。

慢：使用`apply`

此方法使用 python 代码执行替换。 由于我们使用apply执行 python 字节码，它会很慢。 如果您的 DataFrame 很小，那么这不会太慢。

(
    df
    .with_column(
        pl.struct(['Stat0', 'Amount'])
        .apply(lambda cols: cols['Stat0'].replace('$i', cols['Amount']))
        .alias('Stat0')
    )
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0       ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +7 Might    ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆        │
└────────┴──────────────┴──────────┴─────────────┴────────┘

快速：使用`split_exact`和`when/then/otherwise`

此方法使用所有 Polars 表达式。 因此，它会更快，尤其是对于大型 DataFrame。

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
    .drop(['field_0', 'field_1'])
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴────────┴─────────────┘

它是如何工作的：我们首先使用split_exact在$i上拆分Stat0列。 这将产生一个结构。

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬──────────────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0                ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---                  ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ struct[2]            ┆ str    │
╞════════╪══════════════╪══════════╪══════════════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ {"+5 Defense",null}  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ {"+"," Might"}       ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ {"+9 Vitality",null} ┆        │
└────────┴──────────────┴──────────┴──────────────────────┴────────┘

请注意，当Stat0不包含$i时，结构的第二个成员是null 。 我们将利用这一事实来发挥我们的优势。

在下一步中，我们使用unnest将结构分成单独的列。

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
)

shape: (3, 6)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┘

这将创建两个新列： field_0和field_1 。

从这里开始，我们使用when/then/otherwise和concat_str来构造最终结果

基本上：

当$i没有出现在Stat0列时，则字符串没有被拆分，并且field_1为null ，因此我们可以按原样使用field_0中的值。
当$i确实出现在Stat0中时，字符串被分成两部分： field_0和field_1 。 我们只是将这些部分重新连接在一起，将Amount放在中间。

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
)

shape: (3, 7)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┴─────────────┘

Polars - 将列中的部分字符串替换为其他列的值

问题描述

1 个解决方案

解决方案1
1 2022-08-20 14:17:11

慢：使用`apply`

快速：使用`split_exact`和`when/then/otherwise`

Polars - 将列中的部分字符串替换为其他列的值

问题描述

1 个解决方案

解决方案1 1 2022-08-20 14:17:11

慢：使用apply

快速：使用split_exact和when/then/otherwise

解决方案1
1 2022-08-20 14:17:11

慢：使用`apply`

快速：使用`split_exact`和`when/then/otherwise`