Split dataframe column containing iterable

Question

I have a DataFrame with one of the columns containing some sequential data in a form of list or tuple (always the same length), my aim is to split this column into several new columns, ideally updating one of the existing columns.

Here is the minimal example

from pandas import DataFrame, concat

data = DataFrame({"label": [a for a in "abcde"], "x": range(5)})
print(data)

  label  x
0     a  0
1     b  1
2     c  2
3     d  3
4     e  4

The fictional way, using nonexisting function splittuple would be something like this

data[["x", "x2"]] = data["x"].apply(lambda x: (x, x*2)).splittuple(expand = True)

resulting in

  label  x  x2
0     a  0  0
1     b  1  2
2     c  2  4
3     d  3  6
4     e  4  8

Of course I can do it like this, though the solution is bit cloggy

newdata = DataFrame(data["x"].apply(lambda x: (x, x*2)).tolist(), columns = ["x", "x2"])
data.drop("x", axis = 1, inplace = True)
data = concat((data, newdata), axis = 1)
print(data)

  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Alternative even more ugly solution

data[["x", "x2"]] = 
  data["x"].apply(lambda x: "{} {}".format(x, x*2)).str.split(expand = True).astype(int)

Could you suggest more elegant way to do this type of transformation?

Answer 1

It is possible, but not so fast with apply and Series :

tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = tup.apply(pd.Series)

print (data)
  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Faster is use DataFrame constructor:

tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = pd.DataFrame(tup.values.tolist())
print (data)
  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Timings :

data = pd.DataFrame({"label": [a for a in "abcde"], "x": range(5)})
data = pd.concat([data]*1000).reset_index(drop=True)
tup = data["x"].apply(lambda x: (x, x*2))


data[["x", "x2"]] = tup.apply(pd.Series)
data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
print (data)

In [266]: %timeit data[["x", "x2"]] = tup.apply(pd.Series)
1 loop, best of 3: 836 ms per loop

In [267]: %timeit data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
100 loops, best of 3: 3.1 ms per loop

Split dataframe column containing iterable

Question

1 answers

solution1
2 ACCPTED 2018-01-18 15:37:26

Split dataframe column containing iterable

Question

1 answers

solution1 2 ACCPTED 2018-01-18 15:37:26

solution1
2 ACCPTED 2018-01-18 15:37:26