数据错误使用函数和 groupby 来联合熊猫数据帧中的字符串

Question

I have a dataframe of the following structure:我有以下结构的数据框：

mydf: mydf：

    Entry   Address         ShortOrdDesc
0   988     Fake Address 1  SC_M_W_3_1
1   989     Fake Address 2  SC_M_W_3_3
2   992     Fake Address 3  nan_2
3   992                     SC_M_G_1_1
4   992                     SC_M_O_1_1

There is work to be done on this df to combine rows with the same Entry .在这个 df 上有工作要做，以将行与相同的Entry组合起来。 For these only the first row has Address .对于这些，只有第一行有Address 。 I need to concatenate the ShortOrdDesc column and Address .我需要连接ShortOrdDesc列和Address 。 I found a very useful link on this:我找到了一个非常有用的链接：

Pandas groupby: How to get a union of strings Pandas groupby：如何获得字符串的并集

Working from this I have developed the following function:以此为基础，我开发了以下功能：

def f(x):
     return pd.Series(dict(A = x['Entry'].sum(), 
                        B = x['Address'].sum(), 
                        C = "%s" % '; '.join(x['ShortOrdDesc'])))

Which is applied using哪个应用使用

myobj = ordersToprint.groupby('Entry').apply(f)

This returns the error:这将返回错误：

TypeError: must be str, not int类型错误：必须是 str，而不是 int

Looking at my data I don't see what the issue is, as running .sum() on the integers of 'Entry' should work I believe.查看我的数据，我没有看到问题是什么，因为我相信对 'Entry' 的整数运行.sum()应该可以工作。

What is the error in my code or my approach?我的代码或我的方法有什么错误？

Answer 1

I think some column is numeric and need string .我认为某些列是数字并且需要string 。

So use astype and if need remove NaN s add dropna :所以使用astype ，如果需要删除NaN s 添加dropna ：

def f(x):
 return pd.Series(dict(A = x['Entry'].sum(), 
                    B = ''.join(x['Address'].dropna().astype(str)), 
                    C = '; '.join(x['ShortOrdDesc'].astype(str))))

myobj = ordersToprint.groupby('Entry').apply(f)
print (myobj)
          A               B                              C
Entry                                                     
988     988  Fake Address 1                     SC_M_W_3_1
989     989  Fake Address 2                     SC_M_W_3_3
992    2976  Fake Address 3  nan_2; SC_M_G_1_1; SC_M_O_1_1

Another solution with agg , but then is necessary rename columns: agg另一个解决方案，但有必要重命名列：

f = {'Entry':'sum', 
      'Address' : lambda x: ''.join(x.dropna().astype(str)), 
      'ShortOrdDesc' : lambda x: '; '.join(x.astype(str))}
cols = {'Entry':'A','Address':'B','ShortOrdDesc':'C'}
myobj = ordersToprint.groupby('Entry').agg(f).rename(columns=cols)[['A','B','C']]
print (myobj)
          A               B                              C
Entry                                                     
988     988  Fake Address 1                     SC_M_W_3_1
989     989  Fake Address 2                     SC_M_W_3_3
992    2976  Fake Address 3  nan_2; SC_M_G_1_1; SC_M_O_1_1

数据错误使用函数和 groupby 来联合熊猫数据帧中的字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-29 08:19:26

数据错误使用函数和 groupby 来联合熊猫数据帧中的字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-29 08:19:26

解决方案1
1 已采纳 2017-10-29 08:19:26