简体   繁体   中英

Summing specific columns in a panda dataframe

I'm trying to sum specific columns in a panda dataframe. I'm starting with text in the dataframe, given specific words I change the text to a number and then carry out my sum.

I start by creating a sample DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,'produces','produces','understands','produces'], 'b' : [2,'','produces','understands','understands'], 'c' : [3,'','','understands','']})
transposed_df = df.transpose()
transposed_df

Output:

   0         1         2            3            4
a  1  produces  produces  understands     produces
b  2            produces  understands  understands
c  3                      understands             

This is all good and what I expect. I then change the relevant text to integers and create a dataframe of (mostly) integers.

measure1 = transposed_df.iloc[:,[0,1,2]].replace('produces',1)
measure2 = transposed_df.iloc[:,[0,3]].replace('understands',1)
measure3 = transposed_df.iloc[:,[0,4]].replace('produces',1)

measures = [measure1, measure2, measure3]

from functools import reduce
counter = reduce (lambda left, right: pd.merge(left,right), measures)

counter

Output:

   0  1  2  3            4
0  1  1  1  1            1
1  2     1  1  understands
2  3        1             

This is what I expect.

I then try to sum columns 1 and 2 across each row and add it back into transposed_df

transposed_df['first']=counter.iloc[:,[1,2]].sum(axis=1)
transposed_df

Output:

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    NaN
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

I am expecting the final column to be 2,1, 0. What am I doing wrong?

There are two problems: the summation and the insertion of a column with different indexes.

1) Summation

Your df is of type objects (all strings, including empty strings). The dataframe counter is of mixed types too (ints and strings):

counter.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
0    3 non-null int64
1    3 non-null object
2    3 non-null object
3    3 non-null int64
4    3 non-null object
dtypes: int64(2), object(3)

keeping in mind that:

Columns with mixed types are stored with the object dtype. see dtypes

So although the first line of counters contains two integers, they belong to series (columns) of type object and pandas doen't like to sum them up (you're obviously using a pandas version below 0.22.0, in later versions the result is 0.0 with the default min_count=0 , see sum ). You can see this by

counter.iloc[:,[1,2]].applymap(type)

               1              2
0  <class 'int'>  <class 'int'>
1  <class 'str'>  <class 'int'>
2  <class 'str'>  <class 'str'>

So the solution would be to explicitely cast objects to numerical where possible (ie where the whole row is made up of integers, and not empty strings and integers):

counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1)

Result:

0    2.0
1    NaN
2    NaN


2) Column insertion

There are different indexes:

counter.index
# Int64Index([0, 1, 2], dtype='int64')
transposed_df.index
# Index(['a', 'b', 'c'], dtype='object')

Therefore you get all Nans with your method. The easiest way to do it is to insert just the values of the series instead of the series itself (where pandas aligns the index:

transposed_df['first'] = counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1).to_list()

Result:

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    2.0
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM