[英]Complex subset of a data.frame
I have a data frame with close to a million objects in it. 我有一个包含近一百万个对象的数据框。 I need an efficient to way to subset the data based on multiple criteria.
我需要一种有效的方法来基于多个条件对数据进行子集化。 I can do this is a for loop but was wondering if there is a more elegant way to do this.
我可以这样做是一个for循环,但是想知道是否有更优雅的方法可以做到这一点。
Time Instance Server Metric Value
17/08/2014 04:00:00 PM ID1 Server888 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server999 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server777 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server888 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server999 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server777 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID2 Server888 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID2 Server888 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server999 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID3 Server999 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server777 disk.commandsaveraged.average 0
17/08/2014 04:05:00 PM ID3 Server777 disk.commandsaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server888 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server999 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server777 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server888 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server999 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server777 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID2 Server888 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID2 Server888 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server999 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID3 Server999 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server777 disk.numberreadaveraged.average 0
17/08/2014 04:05:00 PM ID3 Server777 disk.numberreadaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server888 disk.numberwriteaveraged.average 0
17/08/2014 04:00:00 PM ID7 Server999 disk.numberwriteaveraged.average 0
17/08/2014 04:00:00 PM ID1 Server777 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server888 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID1 Server999 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID7 Server777 disk.numberwriteaveraged.average 0
17/08/2014 04:00:00 PM ID2 Server888 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID5 Server888 disk.numberwriteaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server999 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID4 Server999 disk.numberwriteaveraged.average 0
17/08/2014 04:00:00 PM ID3 Server777 disk.numberwriteaveraged.average 0
17/08/2014 04:05:00 PM ID3 Server777 disk.numberwriteaveraged.average 0
What I want to do is create a subset where metric == disk.numberwriteaveraged.average
, Server == Server999 & Server == Server888
AND WHERE both servers have the same instance ID's in common. 我想做的是创建一个子集,其中
metric == disk.numberwriteaveraged.average
, Server == Server999 & Server == Server888
两台服务器的实例ID相同。
NOTE, I use the term subset purely because I don't know of any other way to filter data i R, still learning. 注意,我纯粹使用术语“子集”是因为我不知道仍然可以学习的其他任何方法来过滤数据i R。 I am looking for speed and I will be generating data sets much larger than my current one.
我正在寻找速度,并且我将生成比当前数据集大得多的数据集。
(If I understand your question correctly) In your case, data.table
is your friend. (如果我正确理解了您的问题),在您的情况下,
data.table
是您的朋友。 Try (assuming df
is your data set): 尝试(假设
df
是您的数据集):
library(data.table)
df2 <- setDT(df)[, .SD[Metric == "disk.commandsaveraged.average" &
(Server == "Server999" | Server == "Server888")], by = Instance]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.