简体   繁体   English

减去熊猫中的一批列

[英]Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. 我正在过渡到使用熊猫来处理我的csv数据集。 I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. 我目前正在尝试在熊猫中做我已经很容易在numpy中做的事情:从另一组中减去一组列几次。 This is effectively a element-wise matrix subtraction. 这实际上是逐元素矩阵减法。

Just for reference, this used to be my numpy solution for this 仅供参考,这曾经是我为此的numpy解决方案

def subtract_baseline(data, baseline_columns, features_columns):
    """Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
    assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
    num_blocks = len(features_columns)/len(baseline_columns)    
    block_size = len(baseline_columns)                         
    for i in range(num_blocks):
        #Grab each feature block and subract the baseline
        init_col = block_size*i+features_columns[0]
        final_col = init_col+block_size
        data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
    return data 

To ilustrate better, we can create the following toy dataset: 为了更好地说明,我们可以创建以下玩具数据集:

data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)

   L1P1  L1P2  L2P1  L2P2   BP1   BP2
0  10.0  11.0  12.0  13.0   1.0  10.0
1  20.0  21.0  22.0  23.0   1.0  10.0
2  30.0  31.0  32.0  33.0   1.0  10.0
3  40.0  41.0  42.0  43.0   1.0  10.0
4  50.0  51.0  52.0  53.0   1.0  10.0
5  60.0  61.0  62.0  63.0   1.0  10.0

The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function). 正确的输出将是以下结果:获取L1P1和L1P2中的值并减去G1P1和G1P2(又称为基准),然后再次对L2P1,L2P2和可能存在的任何其他列进行此操作(这是我的for循环所做的原始功能)。

   L1P1  L1P2  L2P1  L2P2   BP1   BP2
0   9.0   1.0  11.0   3.0   1.0  10.0
1  19.0  11.0  21.0  13.0   1.0  10.0
2  29.0  21.0  31.0  23.0   1.0  10.0
3  39.0  31.0  41.0  33.0   1.0  10.0
4  49.0  41.0  51.0  43.0   1.0  10.0
5  59.0  51.0  61.0  53.0   1.0  10.0

Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. 请注意,数据框的标签不应更改,理想情况下,我希望使用一种依赖于列索引而不是标签的方法,因为实际数据块为30列,而不是本例中的2列。 This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes. 这就是我在numpy中使用原始函数的方式,参数baseline_columns和features_columns只是列索引的列表。

After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled. 此后,基线列将从数据框中一起删除,因为它们的功能已经完成。

I tried doing this for just 1 batch using iloc but I get Nan values 我尝试使用iloc仅进行了1批处理,但是得到了Nan值

df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]

   L1P1  L1P2  L2P1  L2P2  G1P1  G1P2
0   NaN   NaN  12.0  13.0   1.0  10.0
1   NaN   NaN  22.0  23.0   1.0  10.0
2   NaN   NaN  32.0  33.0   1.0  10.0
3   NaN   NaN  42.0  43.0   1.0  10.0
4   NaN   NaN  52.0  53.0   1.0  10.0
5   NaN   NaN  62.0  63.0   1.0  10.0

Is there a reason you want to do it in one line? 您是否有理由要一行完成? Ie would it be okay for your purposes to do it with two lines: 即以您的目的可以用两行代码来做到这一点:

df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]

These two lines achieve what I think is your intent. 这两行符合我的意图。

Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN 在末尾添加.values ,pandas数据.values将搜索列和索引匹配以进行减法,因为该列与0,1和4,5不匹配,它将返回NaN

df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]: 
   L1P1  L1P2  L2P1  L2P2  BP1   BP2
0   9.0   1.0  12.0  13.0  1.0  10.0
1  19.0  11.0  22.0  23.0  1.0  10.0
2  29.0  21.0  32.0  33.0  1.0  10.0
3  39.0  31.0  42.0  43.0  1.0  10.0
4  49.0  41.0  52.0  53.0  1.0  10.0
5  59.0  51.0  62.0  63.0  1.0  10.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM