I have a dataframe that contains a sequence containing coded values and the day (eg (A,1)) on which it was recorded. My goal is to check for coded values X and Y. And if they occur on the same day, remove the Y value from the sequence.
ID Sequence
1 [(A,1), (B,1), (X,2), (Y,2), (Y,3)]
2 [(C,1), (X,2), (Y,2), (Z,2)]
3 [(C,1), (D,2), (X,3), (Y,3),(Z,3)]
The results I'm expecting are:
ID Sequence
1 [(A,1), (B,1), (X,2), (Y,3)]
2 [(C,1), (X,2), (Z,2)]
3 [(C,1), (D,2), (X,3), (Z,3)]
Is there any way I can write a function to get these results? Any help would be appreciated.
You can check a set membership ( which is quite fast for such usecases ), on the 1th index (2nd item) in the tuple if the first value is in X or Y, if the second item already exists, it wouldn't append the list, then use this function with df.apply
def fun(l):
s = set()
lst = []
for i in l:
if i[0] in ('X','Y'):
if i[1] not in s:
s.add(i[1])
lst.append(i)
else:
lst.append(i)
return lst
df['Sequence'].apply(fun) # df['Sequence']=df['Sequence'].apply(fun) assign back
0 [(A, 1), (B, 1), (X, 2), (Y, 3)]
1 [(C, 1), (X, 2), (Z, 2)]
2 [(C, 1), (D, 2), (X, 3), (Z, 3)]
Name: Sequence, dtype: object
You can make use of itertools.groupby() to group same day into same group then filter out the Y
in same group.
At last use itertools.chain() to flatten list of list.
import itertools
def remove_y(lst):
res = []
for key, values in itertools.groupby(lst, key=lambda x: x[1]):
values = list(values)
if len(values) > 1:
res.append([value for value in values if not 'Y' in value])
else:
res.append(values)
return list(itertools.chain(*res))
df['B'] = df['B'].apply(remove_y)
# print(df)
ID B
0 1 [(A, 1), (B, 1), (X, 2), (Y, 3)]
1 2 [(C, 1), (X, 2), (Z, 2)]
2 3 [(C, 1), (D, 2), (X, 3), (Z, 3)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.