I think my problem is easy to understand but I dont know how to do it without loops in an efficient way.
My dataset (already sorted by ID and Value) has IDs, some Features and a value column (integer) my goal is to keep all consecutive values with the same ID from the first appearance and in case there is only one ID keep that one.
I think is easier to understand with an example so let me show you, my dataset looks like this:
d = {'Id': [1, 1, 1, 1, 2, 3, 3, 3], 'Feature': ['F1', 'F1', 'F1', 'F1', 'F2', 'F3', 'F3', 'F3'], 'Value': [1, 2, 4, 5, 2, 15, 16, 18]}
df = pd.DataFrame(data=d)
Id Feature Value
0 1 F1 1
1 1 F1 2
2 1 F1 4
3 1 F1 5
4 2 F2 2
5 3 F3 15
6 3 F3 16
7 3 F3 18
Note: Duplicates are already dropped. Note2: Features are always the same for the same ID and could coincide with other IDs.
My goal would be to get this returned:
Id Feature Value
0 1 F1 1
1 1 F1 2
4 2 F2 2
5 3 F3 15
6 3 F3 16
PS: Sorry in advance if any grammar mistakes, english is not my first language.
Use DataFrameGroupBy.diff
with replace forst missing values per rows by 1
and compare for not equal 1
, use cumualtive sum by Series.cumsum
, compare by 1
and filter in boolean indexing
:
df = df[df.groupby('Id')['Value'].apply(lambda x: x.diff().ne(1).cumsum()).eq(1)]
print (df)
Id Feature Value
0 1 F1 1
1 1 F1 2
4 2 F2 2
5 3 F3 15
6 3 F3 16
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.