简体   繁体   中英

Subtract from every value in a DataFrame

I have a dataframe that looks like this:

userId   movie1   movie2   movie3   movie4   score
0        4.1      2.1      1.0      NaN      2
1        3.1      1.1      3.4      1.4      1
2        2.8      NaN      1.7      NaN      3
3        NaN      5.0      NaN      2.3      4
4        NaN      NaN      NaN      NaN      1
5        2.3      NaN      2.0      4.0      1

I want to subtract the movie scores from each movie so the output would look like this:

userId   movie1   movie2   movie3   movie4   score
0        2.1      0.1     -1.0      NaN      2
1        2.1      0.1      2.4      0.4      1
2       -0.2      NaN     -2.3      NaN      3
3        NaN      1.0      NaN     -1.7      4
4        NaN      NaN      NaN      NaN      1
5        1.3      NaN      1.0      3.0      1

The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.

I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.

Use df.filter to identify the movie columns and then subtract these columns from score array:

In [35]: x = df.filter(like='movie', axis=1).columns.tolist()

In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]

In [37]: df
Out[37]: 
   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

EDIT: When the movie column names are random. Select all columns except 'userId', 'score' :

x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]

You can use NumPy broadcasting to subtract here.

v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] =  out

df
   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

If you don't know column names use pd.Index.difference here.

cols = df.columns.difference(['userId', 'score']) 
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')

Now, just replace 'movie1':'movie4' with cols .

v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] =  out

A possible solution

import numpy  as np
import pandas as pd

df = pd.DataFrame()
df['userId'] = [0     , 1  , 2     , 3     , 4     , 5     ]
df['movie1'] = [4.1   , 3.1, 2.8   , np.nan, np.nan, 2.3   ]
df['movie2'] = [2.1   , 1.1, np.nan, 5.0   , np.nan, np.nan]
df['movie3'] = [1.0   , 3.4, 1.7   , np.nan, np.nan, 2.0   ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3   , np.nan, 4.0   ]
df['score'] = [2, 1, 3, 4, 5, 6]

print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')

print('after = ', df)

It should return

   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

You can select the columns with iloc if the names of the columns are unknown and use the sub function from pandas to avoid converting to numpy or using apply . I'm assuming value [2,'movie3'] is a typo in your expected output.

df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df

Out:

   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      1
5       5     1.3     NaN     1.0     3.0      1
df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
    lambda x: x - df["score"]
)
print(df)

Prints:

   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

Solution without using .apply() :

df.iloc[:, 1:5] = (
    df.iloc[:, 1:5] 
    - df['score'].values.reshape(-1, 1)
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM