简体   繁体   English

在熊猫数据框中以相同字符串开头的列的总和值

[英]sum values of columns starting with the same string in pandas dataframe

I have a dataframe with about 100 columns that looks like this:我有一个包含大约 100 列的数据框,如下所示:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0  

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe.我的目标是只留下全局类别——英语、历史、文学——并在这个数据框中分别写出它们组成部分的价值总和。 For instance, "English" would be the sum of "English-107" and "English-2":例如,“English”将是“English-107”和“English-2”的总和:

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0          

For this purpose, I have tried two methods.为此,我尝试了两种方法。 First method:第一种方法:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

Second method:第二种方法:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

However, both gives the error:但是,两者都给出了错误:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

My question is either: how can I debug this error or is there another solution for my problem.我的问题是:如何调试此错误,或者是否有其他解决方案可以解决我的问题。 Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.请注意,我有一个相当大的数据框,大约有 100 列和 400000 行,所以我正在寻找优化的解决方案,比如在 Pandas 中使用loc

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.我建议您做一些不同的事情,即执行转置,按行(您的原始列)的前缀分组,求和,然后再次转置。

Consider the following:考虑以下:

df = pd.DataFrame({
        'a_a': [1, 2, 3, 4],
        'a_b': [2, 3, 4, 5],
        'b_a': [1, 2, 3, 4],
        'b_b': [2, 3, 4, 5],
    })

Now现在

[s.split('_')[0] for s in df.T.index.values]

is the prefix of the columns.是列的前缀。 So所以

>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
    a   b
0   3   3
1   5   5
2   7   7
3   9   9

does what you want.做你想做的。

In your case, make sure to split using the '-' character.在您的情况下,请确保使用'-'字符进行拆分。

Using brilliant DSM's idea:使用 DSM 的绝妙创意:

from __future__ import print_function

import pandas as pd

categories = set(['Economics', 'English', 'Histo', 'Literature'])

def correct_categories(cols):
    return [cat for col in cols for cat in categories if col.startswith(cat)]    

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')

#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())

Output:输出:

    Economics  English  Histo  Literature
Id
56          1        1      2           1
11          1        0      0           1
6           1        1      0           0
43          2        0      1           1
14          1        1      1           0

Here is another version, which takes care of "Histo/History" problematic..这是另一个版本,它处理“历史/历史”问题。

from __future__ import print_function

import pandas as pd

#categories = set(['Economics', 'English', 'Histo', 'Literature'])

#
# mapping: common starting pattern: desired name
#
categories = {
    'Histo': 'History',
    'Economics': 'Economics',
    'English': 'English',
    'Literature': 'Literature'
}

def correct_categories(cols):
    return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())

rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])

Output:输出:

    Economics  English  History  Literature
Id
56          1        1        2           1
11          1        0        0           1
6           1        1        0           0
43          2        0        1           1
14          1        1        1           0
History
 Id
56    2
11    0
6     0
43    1
14    1
Name: History, dtype: int64

PS You may want to add missing categories to categories map/dictionary PS 您可能希望将缺少的类别添加到categories地图/字典

您可以使用这些来创建以特定名称开头的列总和,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM