匯總熊貓數據框中的特定列

Question

我正在嘗試對熊貓數據框中的特定列求和。 我從數據框中的文本開始，給定特定的單詞，然后將文本更改為數字，然后進行求和。

我首先創建一個示例DataFrame：

import pandas as pd

df = pd.DataFrame({'a': [1,'produces','produces','understands','produces'], 'b' : [2,'','produces','understands','understands'], 'c' : [3,'','','understands','']})
transposed_df = df.transpose()
transposed_df

輸出：

   0         1         2            3            4
a  1  produces  produces  understands     produces
b  2            produces  understands  understands
c  3                      understands

這一切都很好，也是我所期望的。 然后，我將相關文本更改為整數，並創建一個（主要是）整數的數據框。

measure1 = transposed_df.iloc[:,[0,1,2]].replace('produces',1)
measure2 = transposed_df.iloc[:,[0,3]].replace('understands',1)
measure3 = transposed_df.iloc[:,[0,4]].replace('produces',1)

measures = [measure1, measure2, measure3]

from functools import reduce
counter = reduce (lambda left, right: pd.merge(left,right), measures)

counter

輸出：

   0  1  2  3            4
0  1  1  1  1            1
1  2     1  1  understands
2  3        1

這就是我的期望。

然后，我嘗試對每一行的第1列和第2列求和，然后將其重新添加到transposed_df中

transposed_df['first']=counter.iloc[:,[1,2]].sum(axis=1)
transposed_df

輸出：

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    NaN
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

我期望最后一列是2,1，0。我在做什么錯？

Answer 1

存在兩個問題：求和和插入具有不同索引的列。

1）求和

您的df是objects類型（所有字符串，包括空字符串）。 數據框counter也是混合類型（int和字符串）：

counter.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
0    3 non-null int64
1    3 non-null object
2    3 non-null object
3    3 non-null int64
4    3 non-null object
dtypes: int64(2), object(3)

請記住：

混合類型的列與對象dtype一起存儲。 查看dtypes

因此，盡管counters的第一行包含兩個整數，但它們屬於object類型的系列（列），而熊貓不希望將其求和（顯然，您使用的是低於0.22.0的熊貓版本，在更高版本中，結果是是0.0，默認min_count=0 ，請參見sum ）。 你可以看到這個

counter.iloc[:,[1,2]].applymap(type)

               1              2
0  <class 'int'>  <class 'int'>
1  <class 'str'>  <class 'int'>
2  <class 'str'>  <class 'str'>

因此，解決方案是在可能的情況下將對象顯式轉換為數值（即整行由整數組成，而不是空字符串和整數）：

counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1)

結果：

0    2.0
1    NaN
2    NaN

2）列插入

有不同的索引：

counter.index
# Int64Index([0, 1, 2], dtype='int64')
transposed_df.index
# Index(['a', 'b', 'c'], dtype='object')

因此，使用您的方法可以獲得所有Nans。 最簡單的方法是只插入系列的值，而不是系列本身（大熊貓將索引對齊：

transposed_df['first'] = counter.iloc[:,[1,2]].apply(lambda x: sum(pd.to_numeric(x)), axis=1).to_list()

結果：

   0         1         2            3            4  first
a  1  produces  produces  understands     produces    2.0
b  2            produces  understands  understands    NaN
c  3                      understands                 NaN

匯總熊貓數據框中的特定列

問題描述

1 個解決方案

解決方案1
1 已采納 2019-08-08 17:02:13

1）求和

2）列插入

匯總熊貓數據框中的特定列

問題描述

1 個解決方案

解決方案1 1 已采納 2019-08-08 17:02:13

1）求和

2）列插入

解決方案1
1 已采納 2019-08-08 17:02:13