簡體   English   中英

在R中,如何對data.frame的特定子集執行操作?

[英]In R, how to perform an operation on a specific subset of a data.frame?

(我感覺得到答案后我會覺得自己很傻,但我只是想不通。)

我有一個data.frame,最后有一個空列。 它大部分將填充NA,但我想在其中填充一些值。 此列表示對data.frame中某一列缺失的數據的猜測。

我的初始data.frame看起來像這樣:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |median(df$MaxPlayers[df$MinPlayers ==3,])
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |median(df$MaxPlayers[df$MinPlayers ==2,])
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |

請注意,其中兩行的MaxPlayers具有“ N / A”。 我正在嘗試使用的信息是我不得不猜測的MaxPlayers可能是什么。 如果3個玩家游戲的中位數(MaxPlayers)為6,則對於MinPlayers == 3和MaxPlayers == N / A的游戲,MaxPlayerGuess應該等於6。 (我試圖在代碼中指出上面的示例中MaxPlayerGuess應該獲得什么值。)

產生的data.frame看起來像這樣:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |6
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |4
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |

共享一次嘗試的結果:

gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA)


Error in gld$MaxPlayers[gld$MinPlayers, ] : 
incorrect number of dimensions

相對於發布的示例進行更新。

這是我今天的要訣,有時候,計算所需的內容然后在需要時獲取它會比使用所有這些邏輯相似性更容易。 您正在嘗試提出一種一次性計算所有內容的方法,這使它變得混亂,將其分解為若干步驟。 您需要知道“ MinPlayer”的每個可能組的“ MaxPlayer”的中值。 然后,當缺少MaxPlayer時,您想使用該值。 因此,這是一種簡單的方法。

#generate fake data 
MinPlayer <- rep(3:2, each = 4)
MaxPlayer <- rep(2:5, each = 2, times = 2)

df <- data.frame(MinPlayer, MaxPlayer)

#replace some values of MaxPlayer with NA
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer)

####STARTING DATA
# > df
# MinPlayer MaxPlayer
# 1          3         2
# 2          3         2
# 3          3        NA
# 4          3        NA
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3        NA
# 12         3        NA
# 13         2         4
# 14         2         4
# 15         2         5
# 16         2         5

####STEP 1
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever)
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets
df <- ddply(df, c("MinPlayer"), transform, 
            median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median

####STEP 2
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer)

####STEP 3
#you had to compute an extra column you don't really want, so drop it now that you're done with it
df <- df[ , !(names(df) %in% "median.minp")]

####RESULT
# > df
# MinPlayer MaxPlayer
# 1          2         4
# 2          2         4
# 3          2         5
# 4          2         5
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3         2
# 12         3         2
# 13         3         2
# 14         3         2
# 15         3         2
# 16         3         2

下面的舊答案。

請發布一個可復制的示例!!

#fake data 
this <- rep(1:2, each = 1, times = 2)
that <- rep(3:2, each = 1, times = 2)

df <- data.frame(this, that)

如果您只是在詢問基本的索引...。例如,在滿足條件的地方查找值,這將返回與條件匹配的值的行索引(查找?which):

> which(df$this < df$that)
[1] 1 3

這將返回與您的條件相匹配的事物的VALUE,而不是行索引-您只需要使用“ which”返回的行索引,即可在數據框的正確列中找到相應的值(此處為“ this”)

> df[which(df$this < df$that), "this"]
[1] 1 1

如果要在“ this”小於“ this”時應用一些計算,然后在數據框中添加新列,請使用“ ifelse”。 如果不是,則創建一個邏輯矢量,在該邏輯矢量中,東西與您的條件匹配,然后對東西與您的條件相匹配(例如,您的邏輯測試== TRUE)。

#if "this" is < "that", multiply by 2 
df$result <- ifelse(df$this < df$that, df$this * 2, NA)

> df
this that result
1    1    3      2
2    2    2     NA
3    1    3      2
4    2    2     NA

沒有可重現的示例,將無法提供更多示例。

我認為您已經有了@griffmer的答案中所需的一切。 但是一個不太優雅但也許更直觀的方法可能是一個循環:

## Your data:
df <- data.frame(
        Game = LETTERS[1:11],
        Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
        MinPlayers = c(rep(3,5), rep(2,6)),
        MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)

## Loop over rows:
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){
            if (is.na(df$MaxPlayers[ii])){
                median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]],
                        na.rm = TRUE)               
            } else {
                df$MaxPlayers[ii]
            }           
        }, numeric(1))

這給你

df
#    Game Rating MinPlayers MaxPlayers MaxPlayersGuess
# 1     A    6.0          3          6               6
# 2     B    7.0          3          7               7
# 3     C    6.5          3         NA               6
# 4     D    7.0          3          6               6
# 5     E    7.0          3          5               5
# 6     F    9.5          2          5               5
# 7     G    6.0          2          4               4
# 8     H    7.0          2          4               4
# 9     I    6.5          2         NA               4
# 10    J    7.0          2          2               2
# 11    K    7.0          2          4               4

如果要使用dplyr ,可以嘗試:

輸入:

df <- data.frame(
  Game = LETTERS[1:11],
  Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
  MinPlayers = c(rep(3,5), rep(2,6)),
  MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)

處理:

df %>% 
  group_by(MinPlayers) %>%
  mutate(MaxPlayers = if_else(is.na(MaxPlayers), median(MaxPlayers, na.rm=TRUE), MaxPlayers))

這組的數據基礎MinPlayers和然后分配的中值MaxPlayers與丟失數據的行。

輸出:

Source: local data frame [11 x 4]
Groups: MinPlayers [2]

     Game Rating MinPlayers MaxPlayers
   <fctr>  <dbl>      <dbl>      <dbl>
1       A    6.0          3          6
2       B    7.0          3          7
3       C    6.5          3          6
4       D    7.0          3          6
5       E    7.0          3          5
6       F    9.5          2          5
7       G    6.0          2          4
8       H    7.0          2          4
9       I    6.5          2          4
10      J    7.0          2          2
11      K    7.0          2          4

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM