选择具有两个唯一标识符的 pandas dataframe 中的行并将它们存储为新数据帧

Question

I have an extremely large, unsorted pandas dataframe (over two million rows) with multiple columns, two columns of which identify which category these rows belong to.我有一个非常大的未排序的 pandas dataframe（超过两百万行），其中有多个列，其中两列标识这些行属于哪个类别。 Where the combination of "K" and "U" represent a unique category for these rows, I want to select all the rows that fall into each of these categories, and store these rows as separate dataframes that can be manipulated and analyzed later on for machine learning models. “K”和“U”的组合代表这些行的唯一类别，我想 select 属于每个类别的所有行，并将这些行存储为单独的数据帧，以后可以对其进行操作和分析机器学习模型。 Let me explain让我解释

'a' 'b' 'c' 'K' 'U' 'd'
------------------------
aaa bbb ccc 2245 23 ddd
avd bad cec 2245 23 dwq
avd bad cec 2646 23 dwq
avd bad cec 1621 23 dwq
avd bad cec 1621 26 dwq

The two uppermost rows have the same "K" and "U" value, so I want these to be stored together, however, the other rows all belong to different categories altogether due to having a different combination of K and U , so these will be stored in a separate dataframe.最上面的两行具有相同的“K”和“U”值，所以我希望将它们存储在一起，但是，由于K和U的不同组合，其他行都属于不同的类别，所以这些将存储在单独的 dataframe 中。

My first "solution" for this used a for loop to iterate through the dataframe's K , making a new dataframe encapsulating every row that contains the unique K , and making another for loop for every U in this new K dataframe. I then created a second dataframe in this loop containing that contains every row with this current U .我的第一个“解决方案”使用 for 循环遍历数据帧的K ，创建一个新的 dataframe 封装包含唯一K的每一行，并为这个新K dataframe 中的每个U创建另一个 for 循环。然后我创建了第二个此循环中的 dataframe 包含包含当前U的每一行。 This approach does not work as intended, but I feel I was close to a solution to the problem.这种方法没有按预期工作，但我觉得我已经接近解决问题的方法了。 It is unbearably slow on the full dataframe, and a quicker, proper solution would be appreciated.在完整的 dataframe 上，它的速度慢得令人难以忍受，我们将不胜感激更快、更合适的解决方案。 How would I go about doing this in a proper, more efficient manner?我 go 如何以正确、更有效的方式执行此操作？

Answer 1

You can do it this way:你可以这样做：

         c     K   U    d
     0  aaa  2245  23  ddd
     1  avd  2245  23  dwq
     2  avd  2646  23  dwq
     3  avd  1621  23  dwq
     4  avd  1621  26  dwq 

grouped_df = dataframe.groupby(['K','U'])
for key,df in grouped_df:
  print('\n',key,'\n',df.head())

(1621, 23) 
      c     K   U    d
  3  avd  1621  23  dwq

(1621, 26) 
      c     K   U    d
  4  avd  1621  26  dwq

(2245, 23) 
      c     K   U    d
  0  aaa  2245  23  ddd
  1  avd  2245  23  dwq

(2646, 23) 
      c     K   U    d
  2  avd  2646  23  dwq

In this way you have n different dataframes with same pair value of 'K' and 'U'.通过这种方式，您有 n 个不同的数据帧，它们具有相同的“K”和“U”对值。 After grouping you can access the single dataframe using the get_group method providing the key like:分组后，您可以使用提供密钥的 get_group 方法访问单个 dataframe：

df_n=grouped_df.get_group((2245, 23))

(2245, 23) 
      c     K   U    d
  0  aaa  2245  23  ddd
  1  avd  2245  23  dwq

Answer 2

You can use duplicated , specify a subset of the key columns and pass keep=False .您可以使用duplicated ，指定键列的subset并传递keep=False 。 Then put all of this inside of df[] to filter for those rows:然后将所有这些放在df[]中以过滤这些行：

df[df.duplicated(subset=['K', 'U'], keep=False)]

    a   b   c   K       U   d
0   aaa bbb ccc 2245    23  ddd
1   avd bad cec 2245    23  dwq

For the other dataframe just add a ~ in front:其他的dataframe只要在前面加一个~即可：

df[~df.duplicated(subset=['K', 'U'], keep=False)]

    a   b   c   K       U   d
2   avd bad cec 2646    23  dwq
3   avd bad cec 1621    23  dwq
4   avd bad cec 1621    26  dwq

选择具有两个唯一标识符的 pandas dataframe 中的行并将它们存储为新数据帧

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-12-06 10:36:11

解决方案2
0 2020-12-06 10:38:41

选择具有两个唯一标识符的 pandas dataframe 中的行并将它们存储为新数据帧

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-12-06 10:36:11

解决方案2 0 2020-12-06 10:38:41

解决方案1
0 已采纳 2020-12-06 10:36:11

解决方案2
0 2020-12-06 10:38:41