简体   繁体   中英

How to properly set binary flags in a Python polars dataframe

When implementing a binary flag column in Python polars (v0.15.15), I came across some seemingly weird behavior. Given a df

import polars as pl

df = pl.DataFrame({
        "col1": [0,1,2,3],
        "flag": [0,0,0,0]
    })

I set the flag by or -ing the current flag value, eg 2

df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2) # set flag b0010
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 2    │
│ 1    ┆ 0    │
│ 2    ┆ 0    │
│ 3    ┆ 2    │
└──────┴──────┘

So far so good, however when adding another flag , I get something unexpected:

df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4) # also set flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 2    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 3    ┆ 6    │
└──────┴──────┘

Why are all flags now 6? I'd expect [6, 4, 4, 6]

Doing it the other way around (set flag 4, then flag 2), the result is as expected:

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4)
        .otherwise(pl.col("flag"))
    )
df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2)
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

What's going on here, what am I missing?

Series-wide bitwise operations like OR ( | ) do not seem to be implemented yet; issue submitted on github.

Work-arounds would be for example an apply (rather inefficient):

import polars as pl

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})

df = df.with_columns(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag").apply(lambda flag: flag | 2)) # set flag b0010
        .otherwise(pl.col("flag"))
    )
df = df.with_columns(
        pl.when(pl.col("col1") > -1)
        .then(pl.col("flag").apply(lambda flag: flag | 4)) # set/combine with flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

Or similarly np.bitwise_or (thanks @jqurious):

df.with_columns(
        pl.when(condition_for_flag)
        .then(np.bitwise_or(pl.col("flag"), flag_to_set))
        .otherwise(pl.col("flag"))
        )

or np.where instead of polar's when-then-else, then cast result back to series:

df.with_columns(
        pl.Series(
            np.where(condition_for_flag,
                     df["flag"].to_numpy() | flag_to_set,
                     df["flag"]
            )
        ).alias("flag")
    )

Both np.bitwise_ and np.where seem to be more efficient than the apply . While apply most likely has linear time complexity, np.bitwise_ and np.where might perform differently depending on input size. Test for your specific (typical) input size in case of doubt.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM