When implementing a binary flag column in Python polars (v0.15.15), I came across some seemingly weird behavior. Given a df
import polars as pl
df = pl.DataFrame({
"col1": [0,1,2,3],
"flag": [0,0,0,0]
})
I set the flag by or
-ing the current flag value, eg 2
df = df.with_column(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag") | 2) # set flag b0010
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 2 │
│ 1 ┆ 0 │
│ 2 ┆ 0 │
│ 3 ┆ 2 │
└──────┴──────┘
So far so good, however when adding another flag , I get something unexpected:
df = df.with_column(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag") | 4) # also set flag b0100
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 6 │ # <-- ?! 0 | 4 is 4, not 6
│ 2 ┆ 6 │ # <-- ?! 0 | 4 is 4, not 6
│ 3 ┆ 6 │
└──────┴──────┘
Why are all flags now 6? I'd expect [6, 4, 4, 6]
Doing it the other way around (set flag 4, then flag 2), the result is as expected:
df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_column(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag") | 4)
.otherwise(pl.col("flag"))
)
df = df.with_column(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag") | 2)
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 4 │
│ 3 ┆ 6 │
└──────┴──────┘
What's going on here, what am I missing?
Series-wide bitwise operations like OR ( |
) do not seem to be implemented yet; issue submitted on github.
Work-arounds would be for example an apply
(rather inefficient):
import polars as pl
df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_columns(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag").apply(lambda flag: flag | 2)) # set flag b0010
.otherwise(pl.col("flag"))
)
df = df.with_columns(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag").apply(lambda flag: flag | 4)) # set/combine with flag b0100
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 4 │
│ 3 ┆ 6 │
└──────┴──────┘
Or similarly np.bitwise_or (thanks @jqurious):
df.with_columns(
pl.when(condition_for_flag)
.then(np.bitwise_or(pl.col("flag"), flag_to_set))
.otherwise(pl.col("flag"))
)
or np.where instead of polar's when-then-else, then cast result back to series:
df.with_columns(
pl.Series(
np.where(condition_for_flag,
df["flag"].to_numpy() | flag_to_set,
df["flag"]
)
).alias("flag")
)
Both np.bitwise_
and np.where
seem to be more efficient than the apply
. While apply
most likely has linear time complexity, np.bitwise_
and np.where
might perform differently depending on input size. Test for your specific (typical) input size in case of doubt.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.