[英]Selecting rows in a pandas dataframe with two unique identifiers and storing these as new dataframes
I have an extremely large, unsorted pandas dataframe (over two million rows) with multiple columns, two columns of which identify which category these rows belong to.我有一个非常大的未排序的 pandas dataframe(超过两百万行),其中有多个列,其中两列标识这些行属于哪个类别。 Where the combination of "K" and "U" represent a unique category for these rows, I want to select all the rows that fall into each of these categories, and store these rows as separate dataframes that can be manipulated and analyzed later on for machine learning models.
“K”和“U”的组合代表这些行的唯一类别,我想 select 属于每个类别的所有行,并将这些行存储为单独的数据帧,以后可以对其进行操作和分析机器学习模型。 Let me explain
让我解释
'a' 'b' 'c' 'K' 'U' 'd'
------------------------
aaa bbb ccc 2245 23 ddd
avd bad cec 2245 23 dwq
avd bad cec 2646 23 dwq
avd bad cec 1621 23 dwq
avd bad cec 1621 26 dwq
The two uppermost rows have the same "K" and "U" value, so I want these to be stored together, however, the other rows all belong to different categories altogether due to having a different combination of K and U , so these will be stored in a separate dataframe.最上面的两行具有相同的“K”和“U”值,所以我希望将它们存储在一起,但是,由于K和U的不同组合,其他行都属于不同的类别,所以这些将存储在单独的 dataframe 中。
My first "solution" for this used a for loop to iterate through the dataframe's K , making a new dataframe encapsulating every row that contains the unique K , and making another for loop for every U in this new K dataframe. I then created a second dataframe in this loop containing that contains every row with this current U .我的第一个“解决方案”使用 for 循环遍历数据帧的K ,创建一个新的 dataframe 封装包含唯一K的每一行,并为这个新K dataframe 中的每个U创建另一个 for 循环。然后我创建了第二个此循环中的 dataframe 包含包含当前U的每一行。 This approach does not work as intended, but I feel I was close to a solution to the problem.
这种方法没有按预期工作,但我觉得我已经接近解决问题的方法了。 It is unbearably slow on the full dataframe, and a quicker, proper solution would be appreciated.
在完整的 dataframe 上,它的速度慢得令人难以忍受,我们将不胜感激更快、更合适的解决方案。 How would I go about doing this in a proper, more efficient manner?
我 go 如何以正确、更有效的方式执行此操作?
You can do it this way:你可以这样做:
c K U d
0 aaa 2245 23 ddd
1 avd 2245 23 dwq
2 avd 2646 23 dwq
3 avd 1621 23 dwq
4 avd 1621 26 dwq
grouped_df = dataframe.groupby(['K','U'])
for key,df in grouped_df:
print('\n',key,'\n',df.head())
(1621, 23)
c K U d
3 avd 1621 23 dwq
(1621, 26)
c K U d
4 avd 1621 26 dwq
(2245, 23)
c K U d
0 aaa 2245 23 ddd
1 avd 2245 23 dwq
(2646, 23)
c K U d
2 avd 2646 23 dwq
In this way you have n different dataframes with same pair value of 'K' and 'U'.通过这种方式,您有 n 个不同的数据帧,它们具有相同的“K”和“U”对值。 After grouping you can access the single dataframe using the get_group method providing the key like:
分组后,您可以使用提供密钥的 get_group 方法访问单个 dataframe:
df_n=grouped_df.get_group((2245, 23))
(2245, 23)
c K U d
0 aaa 2245 23 ddd
1 avd 2245 23 dwq
You can use duplicated
, specify a subset
of the key columns and pass keep=False
.您可以使用
duplicated
,指定键列的subset
并传递keep=False
。 Then put all of this inside of df[]
to filter for those rows:然后将所有这些放在
df[]
中以过滤这些行:
df[df.duplicated(subset=['K', 'U'], keep=False)]
a b c K U d
0 aaa bbb ccc 2245 23 ddd
1 avd bad cec 2245 23 dwq
For the other dataframe just add a ~
in front:其他的dataframe只要在前面加一个
~
即可:
df[~df.duplicated(subset=['K', 'U'], keep=False)]
a b c K U d
2 avd bad cec 2646 23 dwq
3 avd bad cec 1621 23 dwq
4 avd bad cec 1621 26 dwq
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.