train_subset.iloc[:10, :10]
1 0 7045 17 16.0 0.0 0.0 0.0 0.0 18.0
1 0 8907 137 48.0 42.0 53.0 32.0 35.0 30.0
1 0 917 167 193.0 88.0 48.0 0.0 24.0 0.0
1 0 203 5303 2073.0 153.0 108.0 97.0 73.0 89.0
1 0 198 817 198.0 0.0 0.0 88.0 93.0 70.0
I have a dataset called train. I've printed out a small subset of it here.
Is there an efficient way to just REMOVE all zeroes from the dataset, updating the columns by shifting to the left? For example, my desired output would be...
Please note, I want this to only trigger starting on the 4th column. The first, second, and third column are meta-data and should not be changed.
train_subset.iloc[:10, :10]
1 0 7045 17 16.0 18.0
1 0 8907 137 48.0 42.0 53.0 32.0 35.0 30.0
1 0 917 167 193.0 88.0 48.0 24.0
1 0 203 5303 2073.0 153.0 108.0 97.0 73.0 89.0
1 0 198 817 198.0 88.0 93.0 70.0
Essentially, you can think of each row as a time-series, and I want to disregard any zeroes. So I only have readings at timepoints where events occur, and I've removed all zeroes between.
I can figure out how to remove zeroes and replace with NA's or whatever, but I cannot figure out how to forcibly shift each row to the left.
np.argsort
+ boolean masking
s = df.iloc[:, 3:]
i = np.argsort(s.eq(0), axis=1)
df.iloc[:, 3:] = np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)
Slice the portion of given dataframe starting from the fourth
column onward then compare it with 0
to create a boolean mask. The idea behind creating the boolean mask is that when we sort the mask in the next step, then the False
values will come first while also maintaining the relative order which essentially provide a shift
like behaviour.
>>> s.eq(0)
3 4 5 6 7 8 9
0 False False True True True True False
1 False False False False False False False
2 False False False False True False True
3 False False False False False False False
4 False False True True False False False
Now use np.argsort
on the above mask to get the indices that would sort
the values in the slice of dataframe s
.
>>> np.argsort(s.eq(0), axis=1)
3 4 5 6 7 8 9
0 0 1 6 2 3 4 5
1 0 1 2 3 4 5 6
2 0 1 2 3 5 4 6
3 0 1 2 3 4 5 6
4 0 1 4 5 6 2 3
Then use np.take_along_axis
with the above indices to sort the values in s
, and assign these sorted values in the given dataframe df
.
>>> np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)
array([[ 17., 16., 18., nan, nan, nan, nan],
[ 137., 48., 42., 53., 32., 35., 30.],
[ 167., 193., 88., 48., 24., nan, nan],
[5303., 2073., 153., 108., 97., 73., 89.],
[ 817., 198., 88., 93., 70., nan, nan]])
>>> df
0 1 2 3 4 5 6 7 8 9
0 1 0 7045 17.0 16.0 18.0 NaN NaN NaN NaN
1 1 0 8907 137.0 48.0 42.0 53.0 32.0 35.0 30.0
2 1 0 917 167.0 193.0 88.0 48.0 24.0 NaN NaN
3 1 0 203 5303.0 2073.0 153.0 108.0 97.0 73.0 89.0
4 1 0 198 817.0 198.0 88.0 93.0 70.0 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.