簡體   English   中英

使用R中的多列將數據集拆分為兩個數據框

[英]Split a dataset into two dataframes using multiple columns in R

讓我們假設我的數據集看起來像這樣:

working_data <- dplyr::data_frame("Date" = c("2015-01-01", "2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-04", "2015-01-04"),
                                  "Time" = c("15:01", "15:01", "21:04", "13:19", "07:15", "07:15", "07:15"),
                                  "SeizureTime" = c("0:10", "0:07", "0:11", "0:04", "0:08", "0:06", "0:07"),
                                  "ET" = c("0:35", "0:35", "0:04", "1:10", "3:35", "3:35", "3:35"),
                                  "ONumber" = c("(123)555-1234", "(123)555-1234", "(123)555-9999", "(000)555-9876", "(123)555-1111", "(123)555-1111", "(123)555-1111"),
                                  "TNumber" = c("(123)555-1234", "(123)555-1234", "(123)555-9999", "(000)555-9876", "(123)555-1111", "(123)555-1111", "(123)555-1111"),
                                  "CT" = c("a", "a", "b", "a", "b", "b", "b"))

我想從這些數據中提取可能重復的行。 我這樣做的方法如下:

while (nrow(working_data) != 0) {
          target_call <- working_data[1, ]
          working_data <- working_data[-1, ]
          similar_calls <- working_data %>% dplyr::filter(Date == target_call$Date,
                                                   Time == target_call$Time,
                                                   ET == target_call$ET,
                                                   ONumber == target_call$ONumber,
                                                   TNumber == target_call$TNumber)

通過第一回路將設置target_call等於第一行working_data和將設置similar_calls等於所述第二行。 假設一切正常……我遇到的問題是,一旦我在target_callsimilar_calls上運行了函數,就不想再看到它們了。 因此,我想從被拉入similar_calls working_data中刪除數據。

一旦我填充了target_callsimilar_calls ,我需要確定哪個調用(如果有)與target_call相同,然后進一步確定哪個是正確的選擇,一旦我選擇了正確的調用,將其添加到一個名為resolved_calls的新數據集。 如果有遺留電話similar_calls ,然后我需要重復選擇呼叫的分析和添加這些調用的一個resolved_calls

我能想到的最好方法是將數據分成兩個單獨的數據幀。 但是當我使用多列時,我不知道該怎么做。 我唯一的選擇是一個非常丑陋的ifelse語句,例如:

working_data$Group <- ifelse(working_data$Date == target_call$Date & ... & working_data$TNumber == target_call$TNumber, 1, 0)
similar_calls <- working_data %>% dplyr::filter(Group == 1)
working_data <- working_data %>% dplyr::filter(Group == 0)

有一個更好的方法嗎?

您尚未真正描述要對每個組執行的操作,但是我們假設您只是想抓住相似調用的每個組中的第一個元素。 然后類似duplicated功能的東西就可以很好地工作:

working_data[with(working_data, !duplicated(paste(Date, Time, ET, ONumber, TNumber))),]
# Source: local data frame [4 x 7]
# 
#         Date  Time SeizureTime    ET       ONumber       TNumber    CT
#        (chr) (chr)       (chr) (chr)         (chr)         (chr) (chr)
# 1 2015-01-01 15:01        0:10  0:35 (123)555-1234 (123)555-1234     a
# 2 2015-01-02 21:04        0:11  0:04 (123)555-9999 (123)555-9999     b
# 3 2015-01-03 13:19        0:04  1:10 (000)555-9876 (000)555-9876     a
# 4 2015-01-04 07:15        0:08  3:35 (123)555-1111 (123)555-1111     b

在dplyr語法中,您可以使用group_by對適當的元素進行分組,然后可以使用帶有row_number filter來獲取每個組中的第一個實例:

working_data %>%
  group_by(Date, Time, ET, ONumber, TNumber) %>%
  filter(row_number() == 1)
# Source: local data frame [4 x 7]
# Groups: Date, Time, ET, ONumber, TNumber [4]
# 
#         Date  Time SeizureTime    ET       ONumber       TNumber    CT
#        (chr) (chr)       (chr) (chr)         (chr)         (chr) (chr)
# 1 2015-01-01 15:01        0:10  0:35 (123)555-1234 (123)555-1234     a
# 2 2015-01-02 21:04        0:11  0:04 (123)555-9999 (123)555-9999     b
# 3 2015-01-03 13:19        0:04  1:10 (000)555-9876 (000)555-9876     a
# 4 2015-01-04 07:15        0:08  3:35 (123)555-1111 (123)555-1111     b

如果要更一般地處理組,則可以使用group_by ,然后以不同的方式summarize以匯總組:

# Take text data in format mm:ss and return the number of seconds
secs <- function(x) {
  spl <- strsplit(x, ":")
  60*as.numeric(sapply(spl, "[", 1)) + as.numeric(sapply(spl, "[", 2))
}
working_data %>%
  group_by(Date, Time, ET, ONumber, TNumber) %>% 
  summarize(meanSeizure=mean(secs(SeizureTime)))
# Source: local data frame [4 x 6]
# Groups: Date, Time, ET, ONumber [?]
# 
#         Date  Time    ET       ONumber       TNumber meanSeizure
#        (chr) (chr) (chr)         (chr)         (chr)       (dbl)
# 1 2015-01-01 15:01  0:35 (123)555-1234 (123)555-1234         8.5
# 2 2015-01-02 21:04  0:04 (123)555-9999 (123)555-9999        11.0
# 3 2015-01-03 13:19  1:10 (000)555-9876 (000)555-9876         4.0
# 4 2015-01-04 07:15  3:35 (123)555-1111 (123)555-1111         7.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM