Given a pd.DataFrame
like:
to_remove pred_0 .... pred_10
0 ['apple'] ['apple','abc'] .... ['apple','orange']
1 ['cd','sister'] ['uncle','cd'] .... ['apple']
On each row, I want to remove the element in pred_0
... pred_10
if this element show up in to_remove
in the same row.
In this example, the answer should be:
to_remove pred_0 .... pred_10
0 ['apple'] ['abc'].... ['orange'] # remove 'apple' this row
1 ['cd','sister'] ['uncle']....['apple'] # remove 'cd' and 'sister' this row
I am wondering how to associate the code to do so.
To generate the example df:
from collections import OrderedDict
D=pd.DataFrame(OrderedDict({'to_remove':[['apple'],['cd','sister']],'pred_0':[['apple','abc'],['uncle','cd']],'pred_1':[['apple','orange'],['apple']]}))
You can try of iterating the each row by row and filter the elements which are not specified in that column
Considered dataframe
pred_0 pred_10 to_remove
0 [apple, abc] [apple, orage] [apple]
1 [uncle, cd] [apple] [cd, sister]
df.apply(lambda x: x[x.index.difference(['to_remove'])].apply(lambda y: [i for i in y if i not in x['to_remove']]),1)
Out:
pred_0 pred_10
0 [abc] [orage]
1 [uncle] [apple]
You can use a couple of list comprehensions:
s = df['to_remove'].map(set)
for col in ['pred_0', 'pred_1']:
df[col] = [[i for i in L if i not in S] for L, S in zip(df[col], s)]
print(df)
to_remove pred_0 pred_1
0 [apple] [abc] [orange]
1 [cd, sister] [uncle] [apple]
List comprehensions will likely be more efficient than pd.DataFrame.apply
, which has the expensive of constructing and passing a series to a function for each row. As you can see, there's no real leveraging of Pandas / NumPy for your requirement.
As such, unless you can afford to expand your lists into series of strings, dict
+ list
may be a more appropriate choice of data structure.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.