Split a column which is separated by comma based on certain condition in pandas

Question

I have a dataframe

df = pd.DataFrame([["A","a*k,x*k,z,c*m,r,s,f*f,e*d"], ["B","h*t,y,a,w*b,Z,c*b,i*t,f*f"]], columns=["id","c1"])

I want to split the column c1 separated by a comma on the below condition:

retain only those strings which have * in it
If after * if any duplicate letter is there then don't consider 2ns string take the next one.
retain the only 1st 3 strings and split them on * and make 2 separate columns

Example: for 1st row in c1,

z,r,s don't have *, so remove them
in a*k and x *k, k is duplicated, so retain 1st one
So top 3 will be a*k, c *m, f *f, split it and make 2 columns c2, a,c,f and c3 k,m,f.

Expected Output:

df_out = pd.DataFrame([["A","a*k,x*k,z,c*m,r,s,f*f,e*d","a,c,f","k,m,f"], ["B","h*t,y,a,w*b,Z,c*b,i*b,f*f","h,w,f","t,b,f"]], columns=["id","c1","c2","c3"])

How to do it?

Answer 1

You can use pd.Series.str.extractall with GroupBy.apply to drop duplicates and get first 3 strings.

out = df["c1"].str.extractall(r"(.)\*(.)").groupby(level=0)
df[["c2", "c3"]] = out.apply(
    lambda x: x.drop_duplicates(subset=1).head(3).agg(",".join)
)

# df
  id                         c1     c2     c3
0  A  a*k,x*k,z,c*m,r,s,f*f,e*d  a,c,f  k,m,f
1  B  h*t,y,a,w*b,Z,c*b,i*t,f*f  h,w,f  t,b,f

Answer 2

First define a function to generate 2 new columns:

def newCols(lst):
    return pd.Series(filter(lambda tt: tt.find('*') >= 0, lst))\
        .str.split('*', expand=True)\
        .rename(columns={0: 'c2', 1: 'c3'})\
        .drop_duplicates(subset='c3').iloc[:3]\
        .apply(lambda col: ','.join(col))

Then generate the result as:

result = df.join(df.c1.str.split(',').apply(newCols))

The result is:

  id                         c1     c2     c3
0  A  a*k,x*k,z,c*m,r,s,f*f,e*d  a,c,f  k,m,f
1  B  h*t,y,a,w*b,Z,c*b,i*t,f*f  h,w,f  t,b,f

Steps of processing in newCols

pd.Series(filter(lambda tt: tt.find('*') >= 0, lst)) - Create a Series from elements containing an asterisk.
str.split('*', expand=True) - Convert it to a DataFrame.
rename(columns={0: 'c2', 1: 'c3'}) - Rename columns to 'c2' and 'c3' .
drop_duplicates(subset='c3') - Remove duplicate rows (with same c3 ).
iloc[:3] - Take only 3 initial rows.
apply(lambda col: ','.join(col) - Join each column into a string.

Try to execute them as "increasingly expanding code", on:

lst = ['a*k', 'x*k', 'z', 'c*m', 'r', 's', 'f*f', 'e*d']

(the result from the first source row).

Split a column which is separated by comma based on certain condition in pandas

Question

2 answers

solution1
1 ACCPTED 2021-04-06 15:46:38

solution2
1 2021-04-06 16:23:13

Split a column which is separated by comma based on certain condition in pandas

Question

2 answers

solution1 1 ACCPTED 2021-04-06 15:46:38

solution2 1 2021-04-06 16:23:13

solution1
1 ACCPTED 2021-04-06 15:46:38

solution2
1 2021-04-06 16:23:13