简体   繁体   中英

split each cell in dataframe (pandas/python)

I have a large pandas dataframe consisting of many rows and columns containing binary data like '0|1', '0|0','1|1','1|0' which i would like to split either in 2 dataframes, and/or expand so that this (both are useful to me):

        a   b   c   d
rowa    1|0 0|1 0|1 1|0
rowb    0|1 0|0 0|0 0|1
rowc    0|1 1|0 1|0 0|1

becomes

        a   b   c   d
rowa1   1   0   0   1
rowa2   0   1   1   0
rowb1   0   0   0   0
rowb2   1   0   0   1
rowc1   0   1   1   0
rowc2   1   0   0   1

and/or

    df1:    a   b   c   d
    rowa    1   0   0   1
    rowb    0   0   0   0
    rowc    0   1   1   0


    df2:    a   b   c   d
    rowa    0   1   1   0
    rowb    1   0   0   1
    rowc    1   0   0   1

currently i'm trying to do something like the following, but believe this is not very effective, any guidance would be helpful.

Atmp_dict=defaultdict(list)
Btmp_dict=defaultdict(list)

for index,row in df.iterrows():
    for columnname in list(df.columns.values):
        Atmp_dict[columnname].append(row[columnname].split('|')[0])
        Btmp_dict[columnname].append(row[columnname].split('|')[1])

user2734178 is close, but his or her answer has some issues. Here is a slight variation that works

import pandas as pd

df1 = pd.DataFrame()
df2 = pd.DataFrame()

# df is your original DataFrame
for col in df.columns:
    df1[col] = df[col].apply(lambda x: x.split('|')[0])
    df2[col] = df[col].apply(lambda x: x.split('|')[1])

Here is another option that is slightly more elegant. Replace the loop with:

for col in df.columns:
    df1[col] = df[col].str.extract("(\d)\|")
    df2[col] = df[col].str.extract("\|(\d)")

This is pretty compact, but it seems like there should be an even easier and more compact way.

df1 = df.applymap( lambda x: str(x)[0] ) 
df2 = df.applymap( lambda x: str(x)[2] )

Or loop over the columns as in the other answers. I don't think it matters. Note that because the question specified binary data, it is OK (and simpler) to just do str[0] and str[2] rather than using split or extract .

Or you could do this, which seems almost silly, but there's nothing actually wrong with it and it is fairly compact.

df1 = df.stack().str[0].unstack()
df2 = df.stack().str[2].unstack()

stack just converts it to a series so you can use str and then unstack converts it back to a dataframe.

Since it looks like all of your values are strings, you can use the .str accessor to split up everything using the pipe as your delimiter, comme ca,

import pandas as pd

df1 = pd.DataFrame()
df2 = pd.DataFrame()

#df is defined as in your first example
for col in df.columns:
    df1[col] = df[col].str[0]
    df2[col] = df[col].str[-1]

You'll then probably want to recast your df1 and df2 as int columns using astype(int) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM