Index-based access to rows in pandas.DataFrame with Sparse columns

Question

Due to memory limitations I have to use sparse columns in a pandas.DataFrame (pandas version 1.0.5). Unfortunately, with index-based access to rows (using .loc[] ), I am running into the following issue:

df = pd.DataFrame.sparse.from_spmatrix(
    scipy.sparse.csr_matrix([[0, 0, 0, 1],
                             [1, 0, 0, 0],
                             [0, 1, 0, 0]])
)

df

Output:

    0   1   2   3
0   0   0   0   1
1   1   0   0   0
2   0   1   0   0

If using .loc :

df.loc[[0,1]]

Output:

    0   1   2       3
0   0   0   NaN     1
1   1   0   NaN     0

Ideally, I would be expecting 0 s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes sort of speak against this:

df.loc[[0,1]].dtypes

Output:

0         Sparse[int32, 0]
1         Sparse[int32, 0]
2       Sparse[float64, 0]
3         Sparse[int32, 0]

(note that the fill-value is still given as 0 , even though the view's dtype for column 2 has changed from Sparse[int32, 0] to Sparse[float64, 0] ).

Can anyone tell me whether all NaN s occuring in a row-sliced pd.DataFrame with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrame s with sparse columns?

Answer 1

So this indeed turned out to be a bug in pandas that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0 ).

In 1.1.0 the minimal example works:

df = pd.DataFrame.sparse.from_spmatrix(
    scipy.sparse.csr_matrix([[0, 0, 0, 1],
                             [1, 0, 0, 0],
                             [0, 1, 0, 0]])
)
df.loc[[0, 1]]

Output:

    0   1   2   3
0   0   0   0   1
1   1   0   0   0

Index-based access to rows in pandas.DataFrame with Sparse columns

Question

1 answers

solution1
1 ACCPTED 2020-08-03 09:14:41

Index-based access to rows in pandas.DataFrame with Sparse columns

Question

1 answers

solution1 1 ACCPTED 2020-08-03 09:14:41

solution1
1 ACCPTED 2020-08-03 09:14:41