简体   繁体   中英

Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.

Just for reference, this used to be my numpy solution for this

def subtract_baseline(data, baseline_columns, features_columns):
    """Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
    assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
    num_blocks = len(features_columns)/len(baseline_columns)    
    block_size = len(baseline_columns)                         
    for i in range(num_blocks):
        #Grab each feature block and subract the baseline
        init_col = block_size*i+features_columns[0]
        final_col = init_col+block_size
        data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
    return data 

To ilustrate better, we can create the following toy dataset:

data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)

   L1P1  L1P2  L2P1  L2P2   BP1   BP2
0  10.0  11.0  12.0  13.0   1.0  10.0
1  20.0  21.0  22.0  23.0   1.0  10.0
2  30.0  31.0  32.0  33.0   1.0  10.0
3  40.0  41.0  42.0  43.0   1.0  10.0
4  50.0  51.0  52.0  53.0   1.0  10.0
5  60.0  61.0  62.0  63.0   1.0  10.0

The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).

   L1P1  L1P2  L2P1  L2P2   BP1   BP2
0   9.0   1.0  11.0   3.0   1.0  10.0
1  19.0  11.0  21.0  13.0   1.0  10.0
2  29.0  21.0  31.0  23.0   1.0  10.0
3  39.0  31.0  41.0  33.0   1.0  10.0
4  49.0  41.0  51.0  43.0   1.0  10.0
5  59.0  51.0  61.0  53.0   1.0  10.0

Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.

After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.

I tried doing this for just 1 batch using iloc but I get Nan values

df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]

   L1P1  L1P2  L2P1  L2P2  G1P1  G1P2
0   NaN   NaN  12.0  13.0   1.0  10.0
1   NaN   NaN  22.0  23.0   1.0  10.0
2   NaN   NaN  32.0  33.0   1.0  10.0
3   NaN   NaN  42.0  43.0   1.0  10.0
4   NaN   NaN  52.0  53.0   1.0  10.0
5   NaN   NaN  62.0  63.0   1.0  10.0

Is there a reason you want to do it in one line? Ie would it be okay for your purposes to do it with two lines:

df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]

These two lines achieve what I think is your intent.

Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN

df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]: 
   L1P1  L1P2  L2P1  L2P2  BP1   BP2
0   9.0   1.0  12.0  13.0  1.0  10.0
1  19.0  11.0  22.0  23.0  1.0  10.0
2  29.0  21.0  32.0  33.0  1.0  10.0
3  39.0  31.0  42.0  43.0  1.0  10.0
4  49.0  41.0  52.0  53.0  1.0  10.0
5  59.0  51.0  62.0  63.0  1.0  10.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM