How to split values of a cell in multiple rows in pandas data frame?

Question

I have a following data frame, which was obtained using the code:

     df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()

Refer here

The resultant data frame obtained :

      id          x,y
  0   1    [(0, 0), (1, 2)]
  1   2    [(1, 3), (1, 2)]
  2   3    [(2, 5), (4, 6)]

Is it possible to get something like this:

         id      x,y
     0   1      (0, 0)
     1   1      (1, 2)
     2   2      (1, 3)
     3   2      (1, 2)
     4   3      (2, 5)
     5   3      (4, 6)

Here, the list of coordinates obtained as a result in previous df is split into new rows against their respective ids.

Answer 1

You can use DataFrame constructor with stack :

df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='x,y')
print (df2)

   id     x,y
0   1  (0, 0)
1   1  (1, 2)
2   2  (1, 3)
3   2  (1, 2)
4   3  (2, 5)
5   3  (4, 6)

numpy solution use numpy.repeat by lengths of values by str.len , x,y column is flattenig by numpy.ndarray.sum :

df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

print (df2)
   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (1, 9)
2   3  (4, 6)

Timings :

In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop

In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop

#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop

Answer 2

Calculating the new 'id' column
- We can use pandas str.len method to quickly count the number of elements in each element's sub-list. This is convenient as we can directly pass this result to the repeat method of df1['id'] which will repeat each element by a corresponding amount from the lengths we passed.
Calculating the new 'x,y' column
- typically, I like to use np.concatenate to push all the sub-lists together. However, in this case, the sub-lists are lists of tuples. np.concatenate will not treat these as lists of objects. So instead, I use the sum method and that will use the underlying sum method on lists, which will in turn concatenate.

`pandas`

if we stick with pandas we can keep the code cleaner
Use repeat with str.len and sum

pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })

   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (4, 6)

`numpy`

we can quicken this approach up by using the underlying numpy arrays and equivalent numpy methods
NOTE: this is equivalent logic!

pd.DataFrame({
        'id': df1['id'].values.repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].values.sum()
    })

We can speed it up even more by skipping the the str.len method and calculating the lengths with a list comprehension.

pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })

Time Tests

small data

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 351 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 590 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

1000 loops, best of 3: 498 µs per loop

larger data

df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 579 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 841 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

1000 loops, best of 3: 704 µs per loop

How to split values of a cell in multiple rows in pandas data frame?

Question

2 answers

solution1
5 ACCPTED 2017-05-02 06:11:25

solution2
2 2017-05-02 06:24:25

`pandas`

`numpy`

Time Tests

How to split values of a cell in multiple rows in pandas data frame?

Question

2 answers

solution1 5 ACCPTED 2017-05-02 06:11:25

solution2 2 2017-05-02 06:24:25

pandas

numpy

Time Tests

solution1
5 ACCPTED 2017-05-02 06:11:25

solution2
2 2017-05-02 06:24:25

`pandas`

`numpy`