[英]How to split values of a cell in multiple rows in pandas data frame?
I have a following data frame, which was obtained using the code: 我有以下数据框,它是使用代码获得的:
df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()
The resultant data frame obtained : 得到的结果数据帧为:
id x,y
0 1 [(0, 0), (1, 2)]
1 2 [(1, 3), (1, 2)]
2 3 [(2, 5), (4, 6)]
Is it possible to get something like this: 是否有可能得到这样的东西:
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
Here, the list of coordinates obtained as a result in previous df is split into new rows against their respective ids. 在此,作为先前df结果的坐标列表将根据其各自的ID分成新的行。
You can use DataFrame
constructor with stack
: 您可以将
DataFrame
构造函数与stack
:
df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='x,y')
print (df2)
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
numpy
solution use numpy.repeat
by lengths
of values by str.len
, x,y
column is flattenig by numpy.ndarray.sum
: numpy
溶液使用numpy.repeat
由lengths
由值str.len
, x,y
列用flattenig numpy.ndarray.sum
:
df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
print (df2)
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (1, 9)
2 3 (4, 6)
Timings : 时间 :
In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop
In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop
#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop
'id'
column 'id'
列
str.len
method to quickly count the number of elements in each element's sub-list. str.len
方法快速计算每个元素的子列表中的元素数量。 This is convenient as we can directly pass this result to the repeat
method of df1['id']
which will repeat each element by a corresponding amount from the lengths we passed. df1['id']
的repeat
方法,该方法将从我们传递的长度开始,将每个元素重复相应的量。 'x,y'
column 'x,y'
列
np.concatenate
to push all the sub-lists together. np.concatenate
将所有子列表一起推送。 However, in this case, the sub-lists are lists of tuples. np.concatenate
will not treat these as lists of objects. np.concatenate
不会将它们视为对象列表。 So instead, I use the sum
method and that will use the underlying sum
method on lists, which will in turn concatenate. sum
方法,并且将在列表上使用基础sum
方法,该方法又将串联在一起。 pandas
if we stick with pandas
we can keep the code cleaner 如果我们坚持使用
pandas
我们可以使代码更整洁
Use repeat
with str.len
and sum
对
str.len
和sum
使用repeat
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (4, 6)
numpy
we can quicken this approach up by using the underlying numpy arrays and equivalent numpy methods 我们可以使用基础的numpy数组和等效的numpy方法来加快此方法的速度
NOTE: this is equivalent logic! 注意:这是等效逻辑!
pd.DataFrame({
'id': df1['id'].values.repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()
})
We can speed it up even more by skipping the the str.len
method and calculating the lengths with a list comprehension. 我们可以通过跳过
str.len
方法并使用列表推导来计算长度来进一步提高速度。
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
small data 小数据
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 351 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 590 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 498 µs per loop
larger data 大数据
df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 579 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 841 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 704 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.