简体   繁体   中英

How to split DataFrame columns into multiple rows?

I am trying to convert multiple columns to multiple rows. Can someone please offer some advice?

I have DataFrame:

id .        values
1,2,3,4     [('a','b'), ('as','bd'),'|',('ss','dd'), ('ws','ee'),'|',('rr','rt'), ('tt','yy'),'|',('yu','uu'), ('ii','oo')]

I need it to look like this:

ID       Values
1         ('a','b'), ('as','bd')
2         ('ss','dd'), ('ws','ee')
3         ('rr','rt'), ('tt','yy')
4         ('yu','uu'), ('ii','oo')

I have tried groupby , split , izip . Maybe I am not doing it the right way?

I made a quick and dirty example how you could parse this dataframe

# example dataframe
df = [
    "1,2,3,4",
    [('a','b'), ('as','bd'), '|', ('ss','dd'), ('ws','ee'), '|', ('rr','rt'), ('tt','yy'), '|', ('yu','uu'), ('ii','oo')]
]

# split ids by comma
ids = df[0].split(",")

# init Id and Items as int and dict()
Id = 0
Items = dict()

# prepare array for data insert
for i in ids:
    Items[i] = []

# insert data
for i in df[1]:
    if isinstance(i, (tuple)):
        Items[ids[Id]].append(i)
    elif isinstance(i, (str)):
        Id += 1

# print data as written in stackoverflow question
print("id .\tvalues")
for item in Items:
    print("{}\t{}".format(item, Items[item]))

I came up with a quite concise solution, based on multi-level grouping, which in my opinion is to a great extent pandasonic .

Start from defining the following function, "splitting" a Series taken from individual values element into a sequence of lists representations, without surrounding [ and ] . The splitting occurs at each '|' element.:

def fn(grp1):
    grp2 = (grp1 == '|').cumsum()
    return grp1[grp1 != '|'].groupby(grp2).apply(lambda x: repr(list(x))[1:-1])

(will be used a bit later).

The first step of processing is to convert id column into a Series :

sId = df.id.apply(lambda x: pd.Series(x.split(','))).stack().rename('ID')

For your data the result is:

0  0    1
   1    2
   2    3
   3    4
Name: ID, dtype: object

The first level of MultiIndex is the index of the source row and the second level are consecutive numbers (within the current row).

Now it's time to perform similar conversion of values column:

sVal = pd.DataFrame(df['values'].values.tolist(), index= df.index)\
    .stack().groupby(level=0).apply(fn).rename('Values')

The result is:

0  0      ('a', 'b'), ('as', 'bd')
   1    ('ss', 'dd'), ('ws', 'ee')
   2    ('rr', 'rt'), ('tt', 'yy')
   3    ('yu', 'uu'), ('ii', 'oo')
Name: Values, dtype: object

Note that the MultiIndex above has the same structure as in the case of sId .

And the last step is to concat both these partial results:

result = pd.concat([sId, sVal], axis=1).reset_index(drop=True)

The result is:

  ID                      Values
0  1    ('a', 'b'), ('as', 'bd')
1  2  ('ss', 'dd'), ('ws', 'ee')
2  3  ('rr', 'rt'), ('tt', 'yy')
3  4  ('yu', 'uu'), ('ii', 'oo')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM