简体   繁体   中英

Map numpy function to column in PyPolars - specify type

I'm trying to apply / map a Numpy function to rows of my PyPolars DataFrame, but I keep running into a type issue.

Say I have this code:

import polars as pl
import numpy as np

df = pl.DataFrame(
    {
        "xp": [[1.0, 2.0, 3.0]],
        "fp": [[3.0, 2.0, 0.0]],
    }
)
print(df)
┌─────────────────┬─────────────────┐
│ list            ┆ freq            │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [1.0, 2.0, 3.0] ┆ [4.0, 5.0, 6.0] │
└─────────────────┴─────────────────┘

and now I want to use the Numpy linear interpolation function to interpolate some values, but I run into the same error whether I do this:

df = (
    df
    .with_column(
        pl.struct(["xp", "fp"]).map(
            lambda x: np.interp(
                [2.5, 1.5],
                x.struct.field("xp"),
                x.struct.field("fp"),
            )
        ).alias("interp")
    )
)

Or this:

def f(x):
    return [float(x) for x in np.interp(
        [2.5, 1.5],
        x.struct.field("xp").to_numpy(),
        x.struct.field("fp").to_numpy(),
    )]

df = (
    df
    .with_column(
        pl.struct(["xp", "fp"]).map(f).alias("interp")
    )
)

I get the following error:

exceptions.ComputeError: TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

I have tried using .apply which gets me the desired result:

df = (
    df
    .with_column(
        pl.struct(["xp", "fp"]).apply(
            lambda x: np.interp(
                [2.5, 1.5],
                x["xp"],
                x["fp"],
            )
        ).alias("interp")
    )
)
┌─────────────────┬─────────────────┬───────────┐
│ xp              ┆ fp              ┆ interp    │
│ ---             ┆ ---             ┆ ---       │
│ list[f64]       ┆ list[f64]       ┆ object    │
╞═════════════════╪═════════════════╪═══════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1.  2.5] │
└─────────────────┴─────────────────┴───────────┘

But of type Object which I have tried to convert into Float64 with .cast(pl.Float64) but I get this error:

exceptions.InvalidOperationError: cannot cast array of type ObjectChunked to arrow datatype

The docs say that the return type is determined by the first non-null value returned by the function but then I don't understand why it does not use the Float type.

Let's first expand your data, so that we can more easily see what is happening.

import polars as pl
import numpy as np

df = pl.DataFrame(
    {
        "xp": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],
        "fp": [[3.0, 2.0, 0.0], [6.0, 5.0, 3.0]],
    }
)
df
shape: (2, 2)
┌─────────────────┬─────────────────┐
│ xp              ┆ fp              │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4.0, 5.0, 6.0] ┆ [6.0, 5.0, 3.0] │
└─────────────────┴─────────────────┘

The source of the error

With our expanded data, let's take a look at what is happening in the function f and find the actual source of the error.

Let's change f to simply print what it sees (and return None ):

def f(x):
    print("x is of type:", type(x))
    print(f"{x=}")
    print("x.struct.field('xp') is: ", x.struct.field('xp'))

    print("Before call to numpy")
    [float(x) for x in np.interp(
        [2.5, 1.5],
        x.struct.field("xp").to_numpy(),
        x.struct.field("fp").to_numpy(),
    )]
    print("After call to numpy")
    return None


(
    df
    .with_column(
        pl.struct(["xp", "fp"]).map(f).alias("interp")
    )
)
x is of type: <class 'polars.internals.series.series.Series'>

x=shape: (2,)
Series: 'xp' [struct[2]]
[
        {[1.0, 2.0, 3.0],[3.0, 2.0, 0.0]}
        {[4.0, 5.0, 6.0],[6.0, 5.0, 3.0]}
]

x.struct.field('xp') is:  shape: (2,)
Series: 'xp' [list]
[
        [1.0, 2.0, 3.0]
        [4.0, 5.0, 6.0]
]

Before call to numpy

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 4066, in with_column
    self.lazy()
  File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 906, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ComputeError: TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Notice that what Polars' map function passes to f (the parameter x ) is really a Series of struct. (This is true even though your original DataFrame has only one row.)

As such, x.struct.field('xp') is really a Series of lists. This is not what we want. What we want is to pass the lists in each row to the numpy.interp function separately.

As a result, numpy raises an Exception. (Notice that the print("After call to numpy") does not execute.

Iterating over the Series

What we want to do is iterate over the Series of structs and pass each row separately to numpy.interp .

Here's how we can do that:

def f(series_of_struct: pl.Series) -> pl.Series:
    _values = [
        [
            val
            for val in np.interp(
                [2.5, 1.5],
                next_struct["xp"],
                next_struct["fp"],
            )
        ]
        for next_struct in series_of_struct
    ]

    return pl.Series(values=_values)


(
    df
    .with_column(
        pl.struct(["xp", "fp"]).map(f).alias("interp")
    )
)

shape: (2, 3)
┌─────────────────┬─────────────────┬────────────┐
│ xp              ┆ fp              ┆ interp     │
│ ---             ┆ ---             ┆ ---        │
│ list[f64]       ┆ list[f64]       ┆ list[f64]  │
╞═════════════════╪═════════════════╪════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1.0, 2.5] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4.0, 5.0, 6.0] ┆ [6.0, 5.0, 3.0] ┆ [6.0, 6.0] │
└─────────────────┴─────────────────┴────────────┘

Most of your question is already answered perfectly by @cbilot, but maybe as an additional answer concerning your type problem with your apply solution and as a generel help dealing with numpy functions in polars.

Depending on the numpy functions but many return a ndarray which polars can't handle (currently). So therefor it has the type Object.

So the solution to your problem is to change the ndarray to something which polars can handle like a list.

Example

import polars as pl
import numpy as np

df = pl.DataFrame(
    {
        "xp": [[1.0, 2.0, 3.0]],
        "fp": [[3.0, 2.0, 0.0]],
    }
)

df = (
    df
    .with_column(
        pl.struct(["xp", "fp"]).apply(
            lambda x: np.interp(
                              [2.5, 1.5],
                x["xp"],
                x["fp"]).tolist()
        ).alias("interp")
    )
)

print(df)



┌─────────────────┬─────────────────┬────────────┐
│ xp              ┆ fp              ┆ interp     │
│ ---             ┆ ---             ┆ ---        │
│ list[f64]       ┆ list[f64]       ┆ list[f64]  │
╞═════════════════╪═════════════════╪════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1.0, 2.5] │
└─────────────────┴─────────────────┴────────────┘

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM