简体   繁体   English

我可以在 `polars` 的以下表达式中使用新创建的变量吗?

[英]Can I use newly created variables in the following expressions in `polars`?

In R (and in particular in dplyr::mutate() ), I'm used to use newly created variables in the following expressions, like so:R (特别是在dplyr::mutate() )中,我习惯于在以下表达式中使用新创建的变量,如下所示:

library(dplyr, warn.conflicts = FALSE)

head(iris) |> 
  mutate(
    sp1 = Sepal.Length + 1,
    sp2 = sp1 + 1
  )
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sp1 sp2
#> 1          5.1         3.5          1.4         0.2  setosa 6.1 7.1
#> 2          4.9         3.0          1.4         0.2  setosa 5.9 6.9
#> 3          4.7         3.2          1.3         0.2  setosa 5.7 6.7
#> 4          4.6         3.1          1.5         0.2  setosa 5.6 6.6
#> 5          5.0         3.6          1.4         0.2  setosa 6.0 7.0
#> 6          5.4         3.9          1.7         0.4  setosa 6.4 7.4

I'm now trying to learn polars and it seems I can't reproduce this behaviour (I'm using the Python version here to stick to the source as close as possible since the R version is not very complete yet):我现在正在尝试学习polars ,但似乎无法重现此行为(我在这里使用 Python 版本以尽可能接近源代码,因为R版本还不是很完整):

import polars as pl

df = pl.DataFrame({"nrs": [1, 2, 3, None, 5]})

mod = df.with_columns(
    (pl.col("nrs") + 1).alias("nrs+1"),
    (pl.col("nrs+1") + 1).alias("nrs+2")
)
Traceback (most recent call last):
  File "<PATH>", line 6, in <module>
    mod = df.with_columns(
          ^^^^^^^^^^^^^^^^
  File "<PATH>", line 7260, in with_columns
    .collect(no_optimization=True)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<PATH>", line 1501, in collect     
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
exceptions.ColumnNotFoundError: nrs+1

pip show polars : pip show polars

Name: polars
Version: 0.18.0

Is this feature unavailable with polars or am I missing something?此功能是否不适用于polars还是我遗漏了什么?

You need multiple .with_columns calls:您需要多次.with_columns调用:

df = pl.DataFrame({"nrs": [1, 2, 3, None, 5]})

(df.with_columns((pl.col("nrs") + 1).alias("nrs+1"))
   .with_columns((pl.col("nrs+1") + 1).alias("nrs+2"))
)
shape: (5, 3)
┌──────┬───────┬───────┐
│ nrs  ┆ nrs+1 ┆ nrs+2 │
│ ---  ┆ ---   ┆ ---   │
│ i64  ┆ i64   ┆ i64   │
╞══════╪═══════╪═══════╡
│ 1    ┆ 2     ┆ 3     │
│ 2    ┆ 3     ┆ 4     │
│ 3    ┆ 4     ┆ 5     │
│ null ┆ null  ┆ null  │
│ 5    ┆ 6     ┆ 7     │
└──────┴───────┴───────┘

Perhaps relevant: https://github.com/pola-rs/polars/issues/9062可能相关: https://github.com/pola-rs/polars/issues/9062

In comparison to dplyr:mutate , polars will run every expression in a single with_columns in parallel while mutate does everything sequentially.dplyr:mutate相比,polars 将并行运行单个with_columns中的每个表达式,而mutate则按顺序执行所有操作。 Since dplyr is doing each column sequentially, when it gets to your second column definition, the first one has already been created so it will just work.由于 dplyr 正在按顺序执行每一列,因此当它到达您的第二列定义时,第一列已经创建,因此它可以正常工作。 Whereas with polars, it sends each column/expression to its own thread/process (I'm not sure which) at the same time so the second one doesn't know anything about the first.而对于极坐标,它会将每个列/表达式同时发送到它自己的线程/进程(我不确定是哪个),所以第二个对第一个一无所知。 This is why with polars, you have to execute it as two with_columns .这就是为什么对于 polars,你必须将它作为两个with_columns来执行。 By example, in dplyr , doing:例如,在dplyr中,执行:

head(iris) |> 
mutate(
    sp1 = Sepal.Length + 1,
    sp2 = sp1 + 1
)

is the same as doing和做的一样

head(iris) |> 
mutate(
    sp1 = Sepal.Length + 1
) |> 
mutate(
    sp2 = sp1 + 1
)

From a quantity of code perspective, the polars way may be more cumbersome and with smaller data that may be the most important consideration.从代码量的角度来看,polars方式可能比较繁琐,数据量较小可能是最重要的考虑因素。 You can monkey patch in a mutate function that will give you the dplyr method of doing everything sequentially such that the syntax you're used to remains, although you're giving up parallelism in the process.您可以在突变 function 中打补丁,这将为您提供按顺序执行所有操作的dplyr方法,以便保留您习惯使用的语法,尽管您在此过程中放弃了并行性。

The monkey patch is this:猴子补丁是这样的:

def mutate(self, *args, **kwargs):
    lazydf=self.lazy()
    for value in args:
        lazydf=lazydf.with_columns(value)
    for key, value in kwargs.items():
        lazydf=lazydf.with_columns(**{key:value})
    return lazydf.collect()
pl.DataFrame.mutate=mutate
del mutate

That allows you to do:这使您可以:

df.mutate(
    (pl.col("nrs")+1).alias("nrs+1"), 
    (pl.col("nrs+1")+1).alias("nrs+2"), 
    **{'nrs+3':pl.col('nrs+2')+1},
    nrs4=pl.col('nrs+3')+1
    )

shape: (5, 5)
┌──────┬───────┬───────┬───────┬──────┐
│ nrs  ┆ nrs+1 ┆ nrs+2 ┆ nrs+3 ┆ nrs4 │
│ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ i64  ┆ i64   ┆ i64   ┆ i64   ┆ i64  │
╞══════╪═══════╪═══════╪═══════╪══════╡
│ 1    ┆ 2     ┆ 3     ┆ 4     ┆ 5    │
│ 2    ┆ 3     ┆ 4     ┆ 5     ┆ 6    │
│ 3    ┆ 4     ┆ 5     ┆ 6     ┆ 7    │
│ null ┆ null  ┆ null  ┆ null  ┆ null │
│ 5    ┆ 6     ┆ 7     ┆ 8     ┆ 9    │
└──────┴───────┴───────┴───────┴──────┘

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM