简体   繁体   English

汇总熊猫数据框中的特定列

[英]Summing specific columns in a panda dataframe

I'm trying to sum specific columns in a panda dataframe. 我正在尝试对熊猫数据框中的特定列求和。 I'm starting with text in the dataframe, given specific words I change the text to a number and then carry out my sum. 我从数据框中的文本开始,给定特定的单词,然后将文本更改为数字,然后进行求和。

I start by creating a sample DataFrame: 我首先创建一个示例DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,'produces','produces','understands','produces'], 'b' : [2,'','produces','understands','understands'], 'c' : [3,'','','understands','']})
transposed_df = df.transpose()
transposed_df

Output: 输出:

   0         1         2            3            4
a  1  produces  produces  understands     produces
b  2            produces  understands  understands
c  3                      understands             

This is all good and what I expect. 这一切都很好,也是我所期望的。 I then change the relevant text to integers and create a dataframe of (mostly) integers. 然后,我将相关文本更改为整数,并创建一个(主要是)整数的数据框。

measure1 = transposed_df.iloc[:,[0,1,2]].replace('produces',1)
measure2 = transposed_df.iloc[:,[0,3]].replace('understands',1)
measure3 = transposed_df.iloc[:,[0,4]].replace('produces',1)

measures = [measure1, measure2, measure3]

from functools import reduce
counter = reduce (lambda left, right: pd.merge(left,right), measures)

counter

Output: 输出:

   0  1  2  3            4
0  1  1  1  1            1
1  2     1  1  understands
2  3        1             

This is what I expect. 这就是我的期望。

I then try to sum columns 1 and 2 across each row and add it back into transposed_df 然后,我尝试对每一行的第1列和第2列求和,然后将其重新添加到transposed_df中

transposed_df['first']=counter.iloc[:,[1,2]].sum(axis=1)
transposed_df

Output: 输出:

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    NaN
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

I am expecting the final column to be 2,1, 0. What am I doing wrong? 我期望最后一列是2,1,0。我在做什么错?

There are two problems: the summation and the insertion of a column with different indexes. 存在两个问题:求和和插入具有不同索引的列。

1) Summation 1)求和

Your df is of type objects (all strings, including empty strings). 您的dfobjects类型(所有字符串,包括空字符串)。 The dataframe counter is of mixed types too (ints and strings): 数据框counter也是混合类型(int和字符串):

counter.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
0    3 non-null int64
1    3 non-null object
2    3 non-null object
3    3 non-null int64
4    3 non-null object
dtypes: int64(2), object(3)

keeping in mind that: 请记住:

Columns with mixed types are stored with the object dtype. 混合类型的列与对象dtype一起存储。 see dtypes 查看dtypes

So although the first line of counters contains two integers, they belong to series (columns) of type object and pandas doen't like to sum them up (you're obviously using a pandas version below 0.22.0, in later versions the result is 0.0 with the default min_count=0 , see sum ). 因此,尽管counters的第一行包含两个整数,但它们属于object类型的系列(列),而熊猫不希望将其求和(显然,您使用的是低于0.22.0的熊猫版本,在更高版本中,结果是是0.0,默认min_count=0 ,请参见sum )。 You can see this by 你可以看到这个

counter.iloc[:,[1,2]].applymap(type)

               1              2
0  <class 'int'>  <class 'int'>
1  <class 'str'>  <class 'int'>
2  <class 'str'>  <class 'str'>

So the solution would be to explicitely cast objects to numerical where possible (ie where the whole row is made up of integers, and not empty strings and integers): 因此,解决方案是在可能的情况下将对象显式转换为数值(即整行由整数组成,而不是空字符串和整数):

counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1)

Result: 结果:

0    2.0
1    NaN
2    NaN


2) Column insertion 2)列插入

There are different indexes: 有不同的索引:

counter.index
# Int64Index([0, 1, 2], dtype='int64')
transposed_df.index
# Index(['a', 'b', 'c'], dtype='object')

Therefore you get all Nans with your method. 因此,使用您的方法可以获得所有Nans。 The easiest way to do it is to insert just the values of the series instead of the series itself (where pandas aligns the index: 最简单的方法是只插入系列的值,而不是系列本身(大熊猫将索引对齐:

transposed_df['first'] = counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1).to_list()

Result: 结果:

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    2.0
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM