[英]Converting tuples in a row to a new columns in pandas Dataframe
I have column with list of tuples, and would like to convert this tuples into a new columns.我有包含元组列表的列,并希望将这些元组转换为新列。 Please see the example below
请看下面的例子
df = pd.DataFrame(dict(a=[1,2,3],
b=['a', 'a', 'b'],
c=[[('pear', 1), ('apple', 2)], [('pear', 7), ('orange', 1)], [('apple', 9)] ]))
df
a b c
0 1 a [(pear, 1), (apple, 2)]
1 2 a [(pear, 7), (orange, 1)]
2 3 b [(apple, 9)]
and would like to convert it to并想将其转换为
a b fruit value
0 1 a pear 1
1 1 a apple 2
2 2 a pear 7
3 2 a orange 1
4 3 b apple 9
I can do it but it is not really efficient, in my case I have more than 500K of rows.我可以做到,但效率不高,就我而言,我有超过 500K 的行。 Is there a more efficient way of doing it?
有没有更有效的方法呢?
Note: I'm using pandas 0.21 and currently cannot upgrade due to my project requirements.
注意:我正在使用 pandas 0.21,由于我的项目要求,目前无法升级。
Thanks谢谢
All three solutions proposed below are great for pandas >=0.25
.下面提出的所有三种解决方案都非常适合
pandas >=0.25
。 For earlier versions df.explode
is not an option.对于早期版本
df.explode
不是一个选项。 And for pandas < 0.24
there is no df.to_numpy
so only solution for earlier versions is @jezreal's solution对于
pandas < 0.24
没有df.to_numpy
所以早期版本的唯一解决方案是@jezreal的解决方案
A small benchmark is below (pandas == 0.25)
(surprisingly explode is slower):下面是一个小基准
(pandas == 0.25)
(令人惊讶的是,explode 更慢):
from itertools import product, chain
def sol_1(df):
phase1 = (product([a],b,c) for a,b,c in df.to_numpy())
phase2 = [(a,b,*c) for a, b, c in chain.from_iterable(phase1)]
return pd.DataFrame(phase2, columns = ["a","b","fruit","value"])
def sol_2(df):
df1 = pd.DataFrame([(k, *x) for k, v in df.c.items() for x in v],
columns=['i','fruit','value'])
df = df.merge(df1, left_index=True, right_on='i').drop('i', axis=1)
return df
def sol_3(df):
df = df.explode('c')
df[['fruit', 'value']] = pd.DataFrame(df['c'].tolist(), index=df.index)
del df['c']
return df
%timeit sol_1(df)
%timeit sol_2(df)
%timeit sol_3(df)
586 µs ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.8 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.14 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Give this a go and see if it works on your version:给这个 go 看看它是否适用于您的版本:
from itertools import product,chain
#create a cartesian for each row in df
phase1 = (product([a],b,c) for a,b,c in df.to_numpy())
#unpack the third entry per row in the flattened iterable
phase2 = [(a,b,*c) for a, b, c in chain.from_iterable(phase1)]
#create dataframe
result = pd.DataFrame(phase2, columns = ["a","b","fruit","value"])
a b fruit value
0 1 a pear 1
1 1 a apple 2
2 2 a pear 7
3 2 a orange 1
4 3 b apple 9
Idea is reshape values in list comprehension to new DataFrame and then use DataFrame.merge
:想法是将列表理解中的值重塑为新的 DataFrame ,然后使用
DataFrame.merge
:
df1 = pd.DataFrame([(k, *x) for k, v in df.pop('c').items() for x in v],
columns=['i','fruit','value'])
print (df1)
i fruit value
0 0 pear 1
1 0 apple 2
2 1 pear 7
3 1 orange 1
4 2 apple 9
df = df.merge(df1, left_index=True, right_on='i').drop('i', axis=1)
print (df)
a b fruit value
0 1 a pear 1
1 1 a apple 2
2 2 a pear 7
3 2 a orange 1
4 3 b apple 9
Maybe you can try like this:也许你可以这样尝试:
df = pd.DataFrame(dict(a=[1,2,3],
b=['a', 'a', 'b'],
c=[[('pear', 1), ('apple', 2)], [('pear', 7), ('orange', 1)], [('apple', 9)] ]))
df = df.explode('c')
df[['fruit', 'value']] = pd.DataFrame(df['c'].tolist(), index=df.index)
del df['c']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.