简体   繁体   中英

Elegantly summing named DataFrame columns in python

I am attempting to write a function that will sum a set of specified columns in a pandas DataFrame.

First, some background. The data each have a column with a name (eg, "var") and a number next to that name in sequential order (eg, "var1, var2"). I know I can sum, say, 5 columns together with the following code:

import pandas as pd
data = pd.read_csv('data_file.csv')
data['var_total'] = data.var1 + data.var2 + data.var3 + data.var4 + data.var5

However, this can be repetitive when you have var1-var30 to sum. I figured there must be some elegant solution to summing them more quickly, since the column names are predictable and uniform. Is there a function I can write or a built-in pandas function that will let me sum these more quickly?

I think you're looking for the filter method of DataFrame ; you can pass it either a string or a regular expression, and it will just return the columns whose names match it. Then you can just call sum or whatever else you want on the resulting columns:

pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']})
  othercol  var1  var2
0      abc     1     2

pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var')
   var1  var2
0     1     2

pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var').sum(axis=1)

0    3

By the way note that I've called sum(axis=1) to return the row-wise sums, by default, sum will return the sum of the columns.

You could do something like this:

data['var_total'] = data.filter(regex='var[0-9]+').sum(axis=1)

This will first filter your dataframe to retain only the columns that start with var and are followed by one or more numbers. Then it will sum across the resulting filtered DataFrame.

Even if you are writing out all the column names there a couple of ways to do the sum a bit more elegantly:

import pandas as pd
import numpy as np

df = pd.DataFrame({'var1': np.random.randint(1, 10, 10),
                   'var2': np.random.randint(1, 10, 10),
                   'var3': np.random.randint(1, 10, 10)})

# Use the sum method:
df[['var1', 'var2', 'var3']].sum(axis='columns')

# Use eval
df.eval('var1 + var2 + var3')

Then you can always use the standard Python tools for manipulating strings to put together the list of column names:

cols = ['var' + str(n) for n in range(1, 3 + 1)]
cols
Out[9]: ['var1', 'var2', 'var3']

df[cols].sum(axis='columns')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM