I have a dataframe that looks like this:
userId movie1 movie2 movie3 movie4 score
0 4.1 2.1 1.0 NaN 2
1 3.1 1.1 3.4 1.4 1
2 2.8 NaN 1.7 NaN 3
3 NaN 5.0 NaN 2.3 4
4 NaN NaN NaN NaN 1
5 2.3 NaN 2.0 4.0 1
I want to subtract the movie scores from each movie so the output would look like this:
userId movie1 movie2 movie3 movie4 score
0 2.1 0.1 -1.0 NaN 2
1 2.1 0.1 2.4 0.4 1
2 -0.2 NaN -2.3 NaN 3
3 NaN 1.0 NaN -1.7 4
4 NaN NaN NaN NaN 1
5 1.3 NaN 1.0 3.0 1
The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.
I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.
Use df.filter
to identify the movie
columns and then subtract
these columns from score
array:
In [35]: x = df.filter(like='movie', axis=1).columns.tolist()
In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]
In [37]: df
Out[37]:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
EDIT: When the movie column names are random. Select all columns except 'userId', 'score'
:
x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]
You can use NumPy broadcasting to subtract here.
v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] = out
df
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
If you don't know column names use pd.Index.difference
here.
cols = df.columns.difference(['userId', 'score'])
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')
Now, just replace 'movie1':'movie4'
with cols
.
v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] = out
A possible solution
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['userId'] = [0 , 1 , 2 , 3 , 4 , 5 ]
df['movie1'] = [4.1 , 3.1, 2.8 , np.nan, np.nan, 2.3 ]
df['movie2'] = [2.1 , 1.1, np.nan, 5.0 , np.nan, np.nan]
df['movie3'] = [1.0 , 3.4, 1.7 , np.nan, np.nan, 2.0 ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3 , np.nan, 4.0 ]
df['score'] = [2, 1, 3, 4, 5, 6]
print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')
print('after = ', df)
It should return
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
You can select the columns with iloc
if the names of the columns are unknown and use the sub
function from pandas to avoid converting to numpy or using apply
. I'm assuming value [2,'movie3']
is a typo in your expected output.
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df
Out:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 1
5 5 1.3 NaN 1.0 3.0 1
df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
lambda x: x - df["score"]
)
print(df)
Prints:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
Solution without using .apply()
:
df.iloc[:, 1:5] = (
df.iloc[:, 1:5]
- df['score'].values.reshape(-1, 1)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.