How to properly set binary flags in a Python polars dataframe

Question

When implementing a binary flag column in Python polars (v0.15.15), I came across some seemingly weird behavior. Given a df

import polars as pl

df = pl.DataFrame({
        "col1": [0,1,2,3],
        "flag": [0,0,0,0]
    })

I set the flag by or -ing the current flag value, eg 2

df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2) # set flag b0010
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 2    │
│ 1    ┆ 0    │
│ 2    ┆ 0    │
│ 3    ┆ 2    │
└──────┴──────┘

So far so good, however when adding another flag , I get something unexpected:

df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4) # also set flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 2    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 3    ┆ 6    │
└──────┴──────┘

Why are all flags now 6? I'd expect [6, 4, 4, 6]

Doing it the other way around (set flag 4, then flag 2), the result is as expected:

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4)
        .otherwise(pl.col("flag"))
    )
df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2)
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

What's going on here, what am I missing?

Answer 1

Series-wide bitwise operations like OR ( | ) do not seem to be implemented yet; issue submitted on github.

Work-arounds would be for example an apply (rather inefficient):

import polars as pl

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})

df = df.with_columns(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag").apply(lambda flag: flag | 2)) # set flag b0010
        .otherwise(pl.col("flag"))
    )
df = df.with_columns(
        pl.when(pl.col("col1") > -1)
        .then(pl.col("flag").apply(lambda flag: flag | 4)) # set/combine with flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

Or similarly np.bitwise_or (thanks @jqurious):

df.with_columns(
        pl.when(condition_for_flag)
        .then(np.bitwise_or(pl.col("flag"), flag_to_set))
        .otherwise(pl.col("flag"))
        )

or np.where instead of polar's when-then-else, then cast result back to series:

df.with_columns(
        pl.Series(
            np.where(condition_for_flag,
                     df["flag"].to_numpy() | flag_to_set,
                     df["flag"]
            )
        ).alias("flag")
    )

Both np.bitwise_ and np.where seem to be more efficient than the apply . While apply most likely has linear time complexity, np.bitwise_ and np.where might perform differently depending on input size. Test for your specific (typical) input size in case of doubt.

How to properly set binary flags in a Python polars dataframe

Question

1 answers

solution1
0 2023-01-19 07:59:25

How to properly set binary flags in a Python polars dataframe

Question

1 answers

solution1 0 2023-01-19 07:59:25

solution1
0 2023-01-19 07:59:25