I have dataframe with movie titles and columns with genres. Such as movie with title 'One' is 'Action' and 'Vestern', because have '1' in appropriate columns.
Movie Action Fantasy Vestern
0 One 1 0 1
1 Two 0 0 1
2 Three 1 1 0
My goal is create column genres
, which will contain name of each genres, that particular movie have. For this I am tried used lambda
and list comprehension
, because thought this helps. But after runned such line of code as:
df['genres'] = df.apply(lambda x: [x+"|"+x for x in df.columns if x!=0])
I got only NaN
value in each row:
Movie Action Fantasy Vestern genres
0 One 1 0 1 NaN
1 Two 0 0 1 NaN
2 Three 1 1 0 NaN
Also tried to use groupby
, but didn't succeed.
Expected output is:
Movie Action Fantasy Vestern genres
0 One 1 0 1 Action|Vestern
1 Two 0 0 1 Vestern
2 Three 1 1 0 Action|Fantasy
Code to reproduce:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Movie":['One','Two','Three'],
"Action":[1,0,1],
"Fantasy":[0,0,1],
"Vestern":[1,1,0]})
print(df)
Thanks for your help
import pandas as pd
import numpy as np
df = pd.DataFrame({"Movie":['One','Two','Three'],
"Action":[1,0,1],
"Fantasy":[0,0,1],
"Vestern":[1,1,0]})
cols = df.columns.tolist()[1:]
df['genres'] = df.apply(lambda x: "|".join(str(z) for z in [i for i in cols if x[i] !=0]) ,axis=1)
print(df)
Movie Action Fantasy Vestern genres
0 One 1 0 1 Action|Vestern
1 Two 0 0 1 Vestern
2 Three 1 1 0 Action|Fantasy
For improve performance is possible use dot
all columns without first with all columns without last with separator
, last remove last |
by rstrip
:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '|').str.rstrip('|')
print (df)
Movie Action Fantasy Vestern new
0 One 1 0 1 Action|Vestern
1 Two 0 0 1 Vestern
2 Three 1 1 0 Action|Fantasy
Or use list comprehensions for join all values without empty strings:
arr = df.iloc[:, 1:].values * df.columns[1:].values
df['new'] = ['|'.join(y for y in x if y) for x in arr]
print (df)
Movie Action Fantasy Vestern new
0 One 1 0 1 Action|Vestern
1 Two 0 0 1 Vestern
2 Three 1 1 0 Action|Fantasy
Performance :
In [54]: %timeit (jez1(df.copy()))
25.2 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [55]: %timeit (jez2(df.copy()))
61.4 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: %timeit (csm(df.copy()))
1.46 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.DataFrame({"Movie":['One','Two','Three'],
"Action":[1,0,1],
"Fantasy":[0,0,1],
"Vestern":[1,1,0]})
#print(df)
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
def csm(df):
cols = df.columns.tolist()[1:]
df['genres'] = df.apply(lambda x: "|".join(str(z) for z in [i for i in cols if x[i] !=0]) ,axis=1)
return df
def jez1(df):
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '|').str.rstrip('|')
return df
def jez2(df):
arr = df.iloc[:, 1:].values * df.columns[1:].values
df['new'] = ['|'.join(y for y in x if y) for x in arr]
return df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.