简体   繁体   English

按特定条件拆分数据帧但保留原始数据帧

[英]Split dataframe by certain condition but keep the original dataframe

I have a dataframe "bb" like this: 我有一个像这样的数据帧“bb”:

Response                                Unique Count
I love it so much!                      246_0    1
This is not bad, but can be better.     246_1    2
Well done, let's do it.                 247_0    1

If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected) 如果count大于1,我想分割字符串并使数据帧“bb”变为:(结果我预期)

Response                                Unique
I love it so much!                      246_0    
This is not bad                         246_1_0    
but can be better.                      246_1_1
Well done, let's do it.                 247_0

My code: 我的代码:

bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)

But the result is like this: 但结果是这样的:

           Response  Unique
0  This is not bad    246_1
1  but can be better. 246_1

How can I keep the original dataframe in this case? 在这种情况下,如何保留原始数据帧?

First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated : 首先将每个条件的值除以新的帮助程序Series ,然后仅通过GroupBy.cumcount按重复的索引值按Index.duplicated添加计数器值:

s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])

mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
              Response   Unique
0   I love it so much!    246_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!    247_0

EDIT: If need _0 for all another values remove mask: 编辑:如果需要_0为所有其他值删除掩码:

s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])

df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
              Response   Unique
0   I love it so much!  246_0_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!  247_0_0

Step wise we can solve this problem the following: 我们可以逐步解决以下问题:

  1. Split your dataframes by count 按计数拆分数据帧
  2. Use this function to explode the string to rows 使用函数可将字符串分解为行
  3. We groupby on index and use cumcount to get the correct unique column values. 我们groupby对指数和使用cumcount以获得正确的unique列值。
  4. Finally we concat the dataframes together again. 最后,我们concat又一起dataframes。

df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1

df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter

# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)

df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
              Response   Unique
0   I love it so much!    246_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!    247_0

Function used from linked answer: 链接答案使用的功能:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM