简体   繁体   中英

Split values in a column and create a matrix of column names

I would like to have a solution for my problem with minimum effort.

Question:

I have a list of values with delimited values. I would like to split and arrange each values at the appropriate cell. Column Heading should be also populated.

Input

A,B,C
C,D,A,E
D,E

Output

+-------+-------+-------+-------+-------+
| VLUE1 | VLUE2 | VLUE3 | VLUE4 | VLUE5 |
+-------+-------+-------+-------+-------+
| A     | B     | C     |       |       |
| A     |       | C     | D     | E     |
|       |       |       | D     | E     |
+-------+-------+-------+-------+-------+

I have a solution using sorting, key value pair in python and iterating but i would like to know is there any shortcut using Python packages or panda?

-Sam

Starting with a series -

s

0      A,B,C
1    C,D,A,E
2        D,E
dtype: object

Convert s to a OHE matrix using get_dummies -

x = s.str.get_dummies(sep=',')
x

   A  B  C  D  E
0  1  1  1  0  0
1  1  0  1  1  1
2  0  0  0  1  1

Use this to create a new dataframe using repeat and array multiplication -

v = x.mul(x.columns).values
c = np.arange(1, x.shape[1] + 1)

df = pd.DataFrame(v, columns=c).add_prefix('VLUE') 
df

  VLUE1 VLUE2 VLUE3 VLUE4 VLUE5
0     A     B     C            
1     A           C     D     E
2                       D     E

get_dummies is the fastest as of I know, here's my try with value_counts and masking ie

mask = df[0].str.split(',',expand=True).apply(pd.value_counts,1).notna()

pd.DataFrame(np.where(mask,mask.columns,'')).add_prefix('VALU')


  VALU0 VALU1 VALU2 VALU3 VALU4
0     A     B     C            
1     A           C     D     E
2                       D     E

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM