I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? Ie would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.
Adding .values
at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.