[英]Split dataframe column containing iterable
I have a DataFrame with one of the columns containing some sequential data in a form of list or tuple (always the same length), my aim is to split this column into several new columns, ideally updating one of the existing columns. 我有一个DataFrame,其中的一列包含以列表或元组(始终相同的长度)形式的一些顺序数据,我的目的是将该列拆分为几个新列,理想情况下更新一个现有列。
Here is the minimal example 这是最小的例子
from pandas import DataFrame, concat
data = DataFrame({"label": [a for a in "abcde"], "x": range(5)})
print(data)
label x
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
The fictional way, using nonexisting function splittuple would be something like this 使用不存在的函数splittuple的虚构方式将是这样的
data[["x", "x2"]] = data["x"].apply(lambda x: (x, x*2)).splittuple(expand = True)
resulting in 导致
label x x2
0 a 0 0
1 b 1 2
2 c 2 4
3 d 3 6
4 e 4 8
Of course I can do it like this, though the solution is bit cloggy 我当然可以这样做,尽管解决方案有点麻烦
newdata = DataFrame(data["x"].apply(lambda x: (x, x*2)).tolist(), columns = ["x", "x2"])
data.drop("x", axis = 1, inplace = True)
data = concat((data, newdata), axis = 1)
print(data)
label x x2
0 a 0 0
1 b 1 2
2 c 2 4
3 d 3 6
4 e 4 8
Alternative even more ugly solution 替代方案更加丑陋
data[["x", "x2"]] =
data["x"].apply(lambda x: "{} {}".format(x, x*2)).str.split(expand = True).astype(int)
Could you suggest more elegant way to do this type of transformation? 您能否建议更优雅的方式来进行此类转换?
It is possible, but not so fast with apply
and Series
: apply
和Series
是可能的,但不是那么快:
tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = tup.apply(pd.Series)
print (data)
label x x2
0 a 0 0
1 b 1 2
2 c 2 4
3 d 3 6
4 e 4 8
Faster is use DataFrame
constructor: 使用
DataFrame
构造函数更快:
tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = pd.DataFrame(tup.values.tolist())
print (data)
label x x2
0 a 0 0
1 b 1 2
2 c 2 4
3 d 3 6
4 e 4 8
Timings : 时间 :
data = pd.DataFrame({"label": [a for a in "abcde"], "x": range(5)})
data = pd.concat([data]*1000).reset_index(drop=True)
tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = tup.apply(pd.Series)
data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
print (data)
In [266]: %timeit data[["x", "x2"]] = tup.apply(pd.Series)
1 loop, best of 3: 836 ms per loop
In [267]: %timeit data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
100 loops, best of 3: 3.1 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.