简体   繁体   English

如何在熊猫数据框中将单元格的值拆分为多行?

[英]How to split values of a cell in multiple rows in pandas data frame?

I have a following data frame, which was obtained using the code: 我有以下数据框,它是使用代码获得的:

     df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()

Refer here 请参考这里

The resultant data frame obtained : 得到的结果数据帧为:

      id          x,y
  0   1    [(0, 0), (1, 2)]
  1   2    [(1, 3), (1, 2)]
  2   3    [(2, 5), (4, 6)]  

Is it possible to get something like this: 是否有可能得到这样的东西:

         id      x,y
     0   1      (0, 0)
     1   1      (1, 2)
     2   2      (1, 3)
     3   2      (1, 2)
     4   3      (2, 5)
     5   3      (4, 6)

Here, the list of coordinates obtained as a result in previous df is split into new rows against their respective ids. 在此,作为先前df结果的坐标列表将根据其各自的ID分成新的行。

You can use DataFrame constructor with stack : 您可以将DataFrame构造函数与stack

df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='x,y')
print (df2)

   id     x,y
0   1  (0, 0)
1   1  (1, 2)
2   2  (1, 3)
3   2  (1, 2)
4   3  (2, 5)
5   3  (4, 6)

numpy solution use numpy.repeat by lengths of values by str.len , x,y column is flattenig by numpy.ndarray.sum : numpy溶液使用numpy.repeatlengths由值str.lenx,y列用flattenig numpy.ndarray.sum

df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

print (df2)
   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (1, 9)
2   3  (4, 6)

Timings : 时间

In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop

In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop

#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop
  • Calculating the new 'id' column 计算新的'id'
    • We can use pandas str.len method to quickly count the number of elements in each element's sub-list. 我们可以使用pandas str.len方法快速计算每个元素的子列表中的元素数量。 This is convenient as we can directly pass this result to the repeat method of df1['id'] which will repeat each element by a corresponding amount from the lengths we passed. 这很方便,因为我们可以将结果直接传递给df1['id']repeat方法,该方法将从我们传递的长度开始,将每个元素重复相应的量。
  • Calculating the new 'x,y' column 计算新的'x,y'
    • typically, I like to use np.concatenate to push all the sub-lists together. 通常,我喜欢使用np.concatenate将所有子列表一起推送。 However, in this case, the sub-lists are lists of tuples. 但是,在这种情况下,子列表是元组列表。 np.concatenate will not treat these as lists of objects. np.concatenate不会将它们视为对象列表。 So instead, I use the sum method and that will use the underlying sum method on lists, which will in turn concatenate. 因此,我改为使用sum方法,并且将在列表上使用基础sum方法,该方法又将串联在一起。

pandas

if we stick with pandas we can keep the code cleaner 如果我们坚持使用pandas我们可以使代码更整洁
Use repeat with str.len and sum str.lensum使用repeat

pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })

   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (4, 6)

numpy

we can quicken this approach up by using the underlying numpy arrays and equivalent numpy methods 我们可以使用基础的numpy数组和等效的numpy方法来加快此方法的速度
NOTE: this is equivalent logic! 注意:这是等效逻辑!

pd.DataFrame({
        'id': df1['id'].values.repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].values.sum()
    })

We can speed it up even more by skipping the the str.len method and calculating the lengths with a list comprehension. 我们可以通过跳过str.len方法并使用列表推导来计算长度来进一步提高速度。

pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })

Time Tests 时间测试

small data 小数据

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 351 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 590 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})
​
1000 loops, best of 3: 498 µs per loop

larger data 大数据

df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 579 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 841 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})
​
1000 loops, best of 3: 704 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM