I'm trying to sum specific columns in a panda dataframe. I'm starting with text in the dataframe, given specific words I change the text to a number and then carry out my sum.
I start by creating a sample DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,'produces','produces','understands','produces'], 'b' : [2,'','produces','understands','understands'], 'c' : [3,'','','understands','']})
transposed_df = df.transpose()
transposed_df
Output:
0 1 2 3 4
a 1 produces produces understands produces
b 2 produces understands understands
c 3 understands
This is all good and what I expect. I then change the relevant text to integers and create a dataframe of (mostly) integers.
measure1 = transposed_df.iloc[:,[0,1,2]].replace('produces',1)
measure2 = transposed_df.iloc[:,[0,3]].replace('understands',1)
measure3 = transposed_df.iloc[:,[0,4]].replace('produces',1)
measures = [measure1, measure2, measure3]
from functools import reduce
counter = reduce (lambda left, right: pd.merge(left,right), measures)
counter
Output:
0 1 2 3 4
0 1 1 1 1 1
1 2 1 1 understands
2 3 1
This is what I expect.
I then try to sum columns 1 and 2 across each row and add it back into transposed_df
transposed_df['first']=counter.iloc[:,[1,2]].sum(axis=1)
transposed_df
Output:
0 1 2 3 4 first
a 1 produces produces understands produces NaN
b 2 produces understands understands NaN
c 3 understands NaN
I am expecting the final column to be 2,1, 0. What am I doing wrong?
There are two problems: the summation and the insertion of a column with different indexes.
Your df
is of type objects
(all strings, including empty strings). The dataframe counter
is of mixed types too (ints and strings):
counter.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
0 3 non-null int64
1 3 non-null object
2 3 non-null object
3 3 non-null int64
4 3 non-null object
dtypes: int64(2), object(3)
keeping in mind that:
Columns with mixed types are stored with the object dtype. see
dtypes
So although the first line of counters
contains two integers, they belong to series (columns) of type object
and pandas doen't like to sum them up (you're obviously using a pandas version below 0.22.0, in later versions the result is 0.0 with the default min_count=0
, see sum ). You can see this by
counter.iloc[:,[1,2]].applymap(type)
1 2
0 <class 'int'> <class 'int'>
1 <class 'str'> <class 'int'>
2 <class 'str'> <class 'str'>
So the solution would be to explicitely cast objects to numerical where possible (ie where the whole row is made up of integers, and not empty strings and integers):
counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1)
Result:
0 2.0
1 NaN
2 NaN
There are different indexes:
counter.index
# Int64Index([0, 1, 2], dtype='int64')
transposed_df.index
# Index(['a', 'b', 'c'], dtype='object')
Therefore you get all Nans with your method. The easiest way to do it is to insert just the values of the series instead of the series itself (where pandas aligns the index:
transposed_df['first'] = counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1).to_list()
Result:
0 1 2 3 4 first
a 1 produces produces understands produces 2.0
b 2 produces understands understands NaN
c 3 understands NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.