简体   繁体   中英

Efficiently Remove all Zeroes from Pandas Dataframe

train_subset.iloc[:10, :10]
1  0  7045    17    16.0    0.0    0.0   0.0   0.0  18.0
1  0  8907   137    48.0   42.0   53.0  32.0  35.0  30.0
1  0   917   167   193.0   88.0   48.0   0.0  24.0   0.0
1  0   203  5303  2073.0  153.0  108.0  97.0  73.0  89.0
1  0   198   817   198.0    0.0    0.0  88.0  93.0  70.0

I have a dataset called train. I've printed out a small subset of it here.

Is there an efficient way to just REMOVE all zeroes from the dataset, updating the columns by shifting to the left? For example, my desired output would be...

Please note, I want this to only trigger starting on the 4th column. The first, second, and third column are meta-data and should not be changed.

train_subset.iloc[:10, :10]
1  0  7045    17    16.0   18.0
1  0  8907   137    48.0   42.0   53.0  32.0  35.0  30.0
1  0   917   167   193.0   88.0   48.0  24.0 
1  0   203  5303  2073.0  153.0  108.0  97.0  73.0  89.0
1  0   198   817   198.0   88.0   93.0  70.0

Essentially, you can think of each row as a time-series, and I want to disregard any zeroes. So I only have readings at timepoints where events occur, and I've removed all zeroes between.

I can figure out how to remove zeroes and replace with NA's or whatever, but I cannot figure out how to forcibly shift each row to the left.

np.argsort + boolean masking

s = df.iloc[:, 3:]
i = np.argsort(s.eq(0), axis=1)
df.iloc[:, 3:] = np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)

Explanations

Slice the portion of given dataframe starting from the fourth column onward then compare it with 0 to create a boolean mask. The idea behind creating the boolean mask is that when we sort the mask in the next step, then the False values will come first while also maintaining the relative order which essentially provide a shift like behaviour.

>>> s.eq(0)

       3      4      5      6      7      8      9
0  False  False   True   True   True   True  False
1  False  False  False  False  False  False  False
2  False  False  False  False   True  False   True
3  False  False  False  False  False  False  False
4  False  False   True   True  False  False  False

Now use np.argsort on the above mask to get the indices that would sort the values in the slice of dataframe s .

>>> np.argsort(s.eq(0), axis=1)

   3  4  5  6  7  8  9
0  0  1  6  2  3  4  5
1  0  1  2  3  4  5  6
2  0  1  2  3  5  4  6
3  0  1  2  3  4  5  6
4  0  1  4  5  6  2  3

Then use np.take_along_axis with the above indices to sort the values in s , and assign these sorted values in the given dataframe df .

>>> np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)

array([[  17.,   16.,   18.,   nan,   nan,   nan,   nan],
       [ 137.,   48.,   42.,   53.,   32.,   35.,   30.],
       [ 167.,  193.,   88.,   48.,   24.,   nan,   nan],
       [5303., 2073.,  153.,  108.,   97.,   73.,   89.],
       [ 817.,  198.,   88.,   93.,   70.,   nan,   nan]])

Result

>>> df

   0  1     2       3       4      5      6     7     8     9
0  1  0  7045    17.0    16.0   18.0    NaN   NaN   NaN   NaN
1  1  0  8907   137.0    48.0   42.0   53.0  32.0  35.0  30.0
2  1  0   917   167.0   193.0   88.0   48.0  24.0   NaN   NaN
3  1  0   203  5303.0  2073.0  153.0  108.0  97.0  73.0  89.0
4  1  0   198   817.0   198.0   88.0   93.0  70.0   NaN   NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM