Due to memory limitations I have to use sparse columns in a pandas.DataFrame
(pandas version 1.0.5). Unfortunately, with index-based access to rows (using .loc[]
), I am running into the following issue:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
2 0 1 0 0
If using .loc
:
df.loc[[0,1]]
Output:
0 1 2 3
0 0 0 NaN 1
1 1 0 NaN 0
Ideally, I would be expecting 0
s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes
sort of speak against this:
df.loc[[0,1]].dtypes
Output:
0 Sparse[int32, 0]
1 Sparse[int32, 0]
2 Sparse[float64, 0]
3 Sparse[int32, 0]
(note that the fill-value is still given as 0
, even though the view's dtype
for column 2 has changed from Sparse[int32, 0]
to Sparse[float64, 0]
).
Can anyone tell me whether all NaN
s occuring in a row-sliced pd.DataFrame
with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrame
s with sparse columns?
So this indeed turned out to be a bug in pandas
that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0 ).
In 1.1.0 the minimal example works:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df.loc[[0, 1]]
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.