简体   繁体   English

选择具有两个唯一标识符的 pandas dataframe 中的行并将它们存储为新数据帧

[英]Selecting rows in a pandas dataframe with two unique identifiers and storing these as new dataframes

I have an extremely large, unsorted pandas dataframe (over two million rows) with multiple columns, two columns of which identify which category these rows belong to.我有一个非常大的未排序的 pandas dataframe(超过两百万行),其中有多个列,其中两列标识这些行属于哪个类别。 Where the combination of "K" and "U" represent a unique category for these rows, I want to select all the rows that fall into each of these categories, and store these rows as separate dataframes that can be manipulated and analyzed later on for machine learning models. “K”“U”的组合代表这些行的唯一类别,我想 select 属于每个类别的所有行,并将这些行存储为单独的数据帧,以后可以对其进行操作和分析机器学习模型。 Let me explain让我解释

'a' 'b' 'c' 'K' 'U' 'd'
------------------------
aaa bbb ccc 2245 23 ddd
avd bad cec 2245 23 dwq
avd bad cec 2646 23 dwq
avd bad cec 1621 23 dwq
avd bad cec 1621 26 dwq

The two uppermost rows have the same "K" and "U" value, so I want these to be stored together, however, the other rows all belong to different categories altogether due to having a different combination of K and U , so these will be stored in a separate dataframe.最上面的两行具有相同的“K”“U”值,所以我希望将它们存储在一起,但是,由于KU的不同组合,其他行都属于不同的类别,所以这些将存储在单独的 dataframe 中。

My first "solution" for this used a for loop to iterate through the dataframe's K , making a new dataframe encapsulating every row that contains the unique K , and making another for loop for every U in this new K dataframe. I then created a second dataframe in this loop containing that contains every row with this current U .我的第一个“解决方案”使用 for 循环遍历数据帧的K ,创建一个新的 dataframe 封装包含唯一K的每一行,并为这个新K dataframe 中的每个U创建另一个 for 循环。然后我创建了第二个此循环中的 dataframe 包含包含当前U的每一行。 This approach does not work as intended, but I feel I was close to a solution to the problem.这种方法没有按预期工作,但我觉得我已经接近解决问题的方法了。 It is unbearably slow on the full dataframe, and a quicker, proper solution would be appreciated.在完整的 dataframe 上,它的速度慢得令人难以忍受,我们将不胜感激更快、更合适的解决方案。 How would I go about doing this in a proper, more efficient manner?我 go 如何以正确、更有效的方式执行此操作?

You can do it this way:你可以这样做:

         c     K   U    d
     0  aaa  2245  23  ddd
     1  avd  2245  23  dwq
     2  avd  2646  23  dwq
     3  avd  1621  23  dwq
     4  avd  1621  26  dwq 

grouped_df = dataframe.groupby(['K','U'])
for key,df in grouped_df:
  print('\n',key,'\n',df.head())

(1621, 23) 
      c     K   U    d
  3  avd  1621  23  dwq

(1621, 26) 
      c     K   U    d
  4  avd  1621  26  dwq

(2245, 23) 
      c     K   U    d
  0  aaa  2245  23  ddd
  1  avd  2245  23  dwq

(2646, 23) 
      c     K   U    d
  2  avd  2646  23  dwq

In this way you have n different dataframes with same pair value of 'K' and 'U'.通过这种方式,您有 n 个不同的数据帧,它们具有相同的“K”和“U”对值。 After grouping you can access the single dataframe using the get_group method providing the key like:分组后,您可以使用提供密钥的 get_group 方法访问单个 dataframe:

df_n=grouped_df.get_group((2245, 23))

(2245, 23) 
      c     K   U    d
  0  aaa  2245  23  ddd
  1  avd  2245  23  dwq

You can use duplicated , specify a subset of the key columns and pass keep=False .您可以使用duplicated ,指定键列的subset并传递keep=False Then put all of this inside of df[] to filter for those rows:然后将所有这些放在df[]中以过滤这些行:

df[df.duplicated(subset=['K', 'U'], keep=False)]

    a   b   c   K       U   d
0   aaa bbb ccc 2245    23  ddd
1   avd bad cec 2245    23  dwq

For the other dataframe just add a ~ in front:其他的dataframe只要在前面加一个~即可:

df[~df.duplicated(subset=['K', 'U'], keep=False)]

    a   b   c   K       U   d
2   avd bad cec 2646    23  dwq
3   avd bad cec 1621    23  dwq
4   avd bad cec 1621    26  dwq

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Pandas 中选择两个 DataFrame 之间的唯一行 - Selecting Unique Rows between Two DataFrames in Pandas 如何使用两个 Pandas 数据帧创建一个新数据帧,其中包含来自一个数据帧的特定行? - How can I use two pandas dataframes to create a new dataframe with specific rows from one dataframe? 将 Pandas 数据帧行拆分为搜索的列值到新的数据帧中 - Split pandas dataframe rows up to searched column value into new dataframes 使用一个 dataframe 行连接两个不同数据帧的列(熊猫) - Use one dataframe rows to connect the columns of two different dataframes (Pandas) 比较两个数据框以使用Pandas返回新数据框-Python - Comparing two dataframes to return a new dataframe using pandas - Python 从两个现有数据帧中创建一个新的 pandas dataframe - Create a new pandas dataframe out of two existing dataframes Python Pandas-将两个数据框与新行和旧行合并 - Python Pandas - Merging two Dataframes with new and old rows 加入两个具有重叠日期的 pandas 数据帧并添加具有重叠的新行 - Join two pandas dataframes with overlapping dates and add new rows with overlaps 根据Pandas python中的两个条件选择数据帧的行 - Selecting rows of a dataframe based on two conditions in Pandas python 选择索引标签位于两个列表之一的Pandas Dataframe行 - Selecting Pandas Dataframe Rows where Index Label is in One of Two Lists
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM