I'm trying to apply / map a Numpy function to rows of my PyPolars DataFrame, but I keep running into a type issue.
Say I have this code:
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"xp": [[1.0, 2.0, 3.0]],
"fp": [[3.0, 2.0, 0.0]],
}
)
print(df)
┌─────────────────┬─────────────────┐
│ list ┆ freq │
│ --- ┆ --- │
│ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╡
│ [1.0, 2.0, 3.0] ┆ [4.0, 5.0, 6.0] │
└─────────────────┴─────────────────┘
and now I want to use the Numpy linear interpolation function to interpolate some values, but I run into the same error whether I do this:
df = (
df
.with_column(
pl.struct(["xp", "fp"]).map(
lambda x: np.interp(
[2.5, 1.5],
x.struct.field("xp"),
x.struct.field("fp"),
)
).alias("interp")
)
)
Or this:
def f(x):
return [float(x) for x in np.interp(
[2.5, 1.5],
x.struct.field("xp").to_numpy(),
x.struct.field("fp").to_numpy(),
)]
df = (
df
.with_column(
pl.struct(["xp", "fp"]).map(f).alias("interp")
)
)
I get the following error:
exceptions.ComputeError: TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
I have tried using .apply
which gets me the desired result:
df = (
df
.with_column(
pl.struct(["xp", "fp"]).apply(
lambda x: np.interp(
[2.5, 1.5],
x["xp"],
x["fp"],
)
).alias("interp")
)
)
┌─────────────────┬─────────────────┬───────────┐
│ xp ┆ fp ┆ interp │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ object │
╞═════════════════╪═════════════════╪═══════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1. 2.5] │
└─────────────────┴─────────────────┴───────────┘
But of type Object which I have tried to convert into Float64 with .cast(pl.Float64)
but I get this error:
exceptions.InvalidOperationError: cannot cast array of type ObjectChunked to arrow datatype
The docs say that the return type is determined by the first non-null value returned by the function but then I don't understand why it does not use the Float type.
Let's first expand your data, so that we can more easily see what is happening.
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"xp": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],
"fp": [[3.0, 2.0, 0.0], [6.0, 5.0, 3.0]],
}
)
df
shape: (2, 2)
┌─────────────────┬─────────────────┐
│ xp ┆ fp │
│ --- ┆ --- │
│ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4.0, 5.0, 6.0] ┆ [6.0, 5.0, 3.0] │
└─────────────────┴─────────────────┘
With our expanded data, let's take a look at what is happening in the function f
and find the actual source of the error.
Let's change f
to simply print what it sees (and return None
):
def f(x):
print("x is of type:", type(x))
print(f"{x=}")
print("x.struct.field('xp') is: ", x.struct.field('xp'))
print("Before call to numpy")
[float(x) for x in np.interp(
[2.5, 1.5],
x.struct.field("xp").to_numpy(),
x.struct.field("fp").to_numpy(),
)]
print("After call to numpy")
return None
(
df
.with_column(
pl.struct(["xp", "fp"]).map(f).alias("interp")
)
)
x is of type: <class 'polars.internals.series.series.Series'>
x=shape: (2,)
Series: 'xp' [struct[2]]
[
{[1.0, 2.0, 3.0],[3.0, 2.0, 0.0]}
{[4.0, 5.0, 6.0],[6.0, 5.0, 3.0]}
]
x.struct.field('xp') is: shape: (2,)
Series: 'xp' [list]
[
[1.0, 2.0, 3.0]
[4.0, 5.0, 6.0]
]
Before call to numpy
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 4066, in with_column
self.lazy()
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 906, in collect
return pli.wrap_df(ldf.collect())
exceptions.ComputeError: TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
Notice that what Polars' map
function passes to f
(the parameter x
) is really a Series of struct. (This is true even though your original DataFrame has only one row.)
As such, x.struct.field('xp')
is really a Series of lists. This is not what we want. What we want is to pass the lists in each row to the numpy.interp
function separately.
As a result, numpy raises an Exception. (Notice that the print("After call to numpy")
does not execute.
What we want to do is iterate over the Series of structs and pass each row separately to numpy.interp
.
Here's how we can do that:
def f(series_of_struct: pl.Series) -> pl.Series:
_values = [
[
val
for val in np.interp(
[2.5, 1.5],
next_struct["xp"],
next_struct["fp"],
)
]
for next_struct in series_of_struct
]
return pl.Series(values=_values)
(
df
.with_column(
pl.struct(["xp", "fp"]).map(f).alias("interp")
)
)
shape: (2, 3)
┌─────────────────┬─────────────────┬────────────┐
│ xp ┆ fp ┆ interp │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╪════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1.0, 2.5] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4.0, 5.0, 6.0] ┆ [6.0, 5.0, 3.0] ┆ [6.0, 6.0] │
└─────────────────┴─────────────────┴────────────┘
Most of your question is already answered perfectly by @cbilot, but maybe as an additional answer concerning your type problem with your apply solution and as a generel help dealing with numpy functions in polars.
Depending on the numpy functions but many return a ndarray which polars can't handle (currently). So therefor it has the type Object.
So the solution to your problem is to change the ndarray to something which polars can handle like a list.
Example
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"xp": [[1.0, 2.0, 3.0]],
"fp": [[3.0, 2.0, 0.0]],
}
)
df = (
df
.with_column(
pl.struct(["xp", "fp"]).apply(
lambda x: np.interp(
[2.5, 1.5],
x["xp"],
x["fp"]).tolist()
).alias("interp")
)
)
print(df)
┌─────────────────┬─────────────────┬────────────┐
│ xp ┆ fp ┆ interp │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╪════════════╡
│ [1.0, 2.0, 3.0] ┆ [3.0, 2.0, 0.0] ┆ [1.0, 2.5] │
└─────────────────┴─────────────────┴────────────┘
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.