使用R在數據集本身中為列分配值

Question

我對R很陌生。我有一列數據，其中包含大約26000個數據，該列包含大約1200個唯一數據。 讓我們假設該列的名稱為“ Breed”。

我需要的是

我需要獲取列中每個唯一值的頻率。
我提取了BreedType和頻率，如下所示。 （品種列的名稱為BreedType）
然后，如果每個BreedType的頻率小於50，則使用if條件，我需要使用'F'創建一個新列，如果大於50，則需要為該列分配'Breedtype'的值。

這是我嘗試過的。

x<- sort(table(full$Breed),decreasing=T)
w=as.data.frame(x)

names(w)[1] = 'BreedType'

w$TrueFalse<-ifelse(w$Freq<50,F,w$BreedType)
w$TrueFalse

但是給出的輸出不是我期望的。 盡管F正確分配了每一列，但w $ BreedType不會獲得BreedType的值，而是將整數加1而不是給出特定的BreedType的整數。

有人可以解釋一下為什么輸出不如預期嗎。

“品種列”在數據集中看起來像下面，具有20,000行和1200個唯一值。

 Breed

 Shetland Sheepdog Mix
 Domestic Shorthair Mix
 Pit Bull Mix
 Domestic Shorthair Mix
 Lhasa Apso/Miniature Poodle
 Cairn Terrier/Chihuahua Shorthair
 Domestic Shorthair Mix
 Domestic Shorthair Mix
 American Pit Bull Terrier Mix
 Cairn Terrier
 Domestic Shorthair Mix
 Miniature Schnauzer Mix
 Pit Bull Mix
 Yorkshire Terrier Mix
 Great Pyrenees Mix
 Domestic Shorthair Mix
 Domestic Shorthair Mix
 Pit Bull Mix
 Angora Mix
 Flat Coat Retriever Mix
 Queensland Heeler Mix
 Domestic Shorthair Mix
 Plott Hound/Boxer

我的預期結果是

BreedType                   Frequency   TrueFalse

Shetland Sheepdog Mix       60          Shetland Sheepdog Mix  
Domestic Shorthair Mix      20          F
Pit Bull Mix                80          Pit Bull Mix
Domestic Shorthair Mix      10          F

Answer 1

原始數據- full數據幀：

> full
#                      Breed
# 1:             Shetland Sheepdog Mix
# 2:            Domestic Shorthair Mix
# 3:                      Pit Bull Mix
# 4:            Domestic Shorthair Mix
# 5:       Lhasa Apso/Miniature Poodle
# 6: Cairn Terrier/Chihuahua Shorthair
# 7:            Domestic Shorthair Mix
# 8:            Domestic Shorthair Mix
# 9:     American Pit Bull Terrier Mix
# 10:                     Cairn Terrier
# 11:            Domestic Shorthair Mix
# 12:           Miniature Schnauzer Mix
# 13:                      Pit Bull Mix
# 14:             Yorkshire Terrier Mix
# 15:                Great Pyrenees Mix
# 16:            Domestic Shorthair Mix
# 17:            Domestic Shorthair Mix
# 18:                      Pit Bull Mix
# 19:                        Angora Mix
# 20:           Flat Coat Retriever Mix
# 21:             Queensland Heeler Mix
# 22:            Domestic Shorthair Mix
# 23:                 Plott Hound/Boxer
# Breed

在工作區中加載data.table庫

library("data.table")

通過引用將full數據幀轉換為數據表

setDT(full)

將full復制到dt1 。 這樣做是為了備份full數據表

dt1 <- copy(full)

通過BreedType（品種列）對dt1進行分組，然后訪問.N內部變量，該變量存儲每個子集中的條目數，並對其進行ifelse條件。 然后將其另存為Frequency和TrueFalse列變量。

dt1[, c("Frequency", "TrueFalse") := .(.N, ifelse(.N < 50, FALSE, Breed)), by = Breed]

在上述步驟之后顯示dt1

> dt1
#                          Breed          Frequency TrueFalse
# 1:             Shetland Sheepdog Mix         1     FALSE
# 2:            Domestic Shorthair Mix         8     FALSE
# 3:                      Pit Bull Mix         3     FALSE
# 4:            Domestic Shorthair Mix         8     FALSE
# 5:       Lhasa Apso/Miniature Poodle         1     FALSE
# 6: Cairn Terrier/Chihuahua Shorthair         1     FALSE
# 7:            Domestic Shorthair Mix         8     FALSE
# 8:            Domestic Shorthair Mix         8     FALSE
# 9:     American Pit Bull Terrier Mix         1     FALSE
# 10:                     Cairn Terrier         1     FALSE
# 11:            Domestic Shorthair Mix         8     FALSE
# 12:           Miniature Schnauzer Mix         1     FALSE
# 13:                      Pit Bull Mix         3     FALSE
# 14:             Yorkshire Terrier Mix         1     FALSE
# 15:                Great Pyrenees Mix         1     FALSE
# 16:            Domestic Shorthair Mix         8     FALSE
# 17:            Domestic Shorthair Mix         8     FALSE
# 18:                      Pit Bull Mix         3     FALSE
# 19:                        Angora Mix         1     FALSE
# 20:           Flat Coat Retriever Mix         1     FALSE
# 21:             Queensland Heeler Mix         1     FALSE
# 22:            Domestic Shorthair Mix         8     FALSE
# 23:                 Plott Hound/Boxer         1     FALSE
# Breed Frequency TrueFalse

您提供的數據的繁殖類型頻率不超過50。如果您有一個，則將根據ifelse語句添加繁殖類型，而不是FALSE。

Answer 2

假設您已經實現了每個BreedType的頻率實現。 這類似於@Sathish，但是使用data.frame而不是data.table

testData <- data.frame(BreedType = c("Shetland Sheepdog Mix", "Domestic Shorthair Mix", "Pit Bull Mix", "Domestic Shorthair Mix"),
                   Frequency = c(60, 20, 80, 10), stringsAsFactors = F)
testData$TrueFalse <- testData$BreedType
testData$TrueFalse[testData$Frequency < 50] <- F

輸出與您所擁有的相同。 但是，“ FALSE”將轉換為字符串（而不是布爾值），因為該列已初始化為字符向量。 我不確定您是否可以混合使用布爾值和字符串。

Answer 3

您可以使用plyr包中的count功能。 我已經使用您所提供的數據演示了一個示例。

> library(plyr)

> df <- read.table(text = "Shetland Sheepdog Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Domestic Shorthair Mix
  Lhasa Apso/Miniature Poodle
  Cairn Terrier/Chihuahua Shorthair
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  American Pit Bull Terrier Mix
  Cairn Terrier
  Domestic Shorthair Mix
  Miniature Schnauzer Mix
  Pit Bull Mix
  Yorkshire Terrier Mix
  Great Pyrenees Mix
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Angora Mix
  Flat Coat Retriever Mix
  Queensland Heeler Mix
  Domestic Shorthair Mix
  Plott Hound/Boxer", sep='\n', stringsAsFactors = F, col.names = c('Breed'))

使用plyr::count函數。

> df <- count(df, 'Breed')

> df 

##                                 Breed freq
## 1       American Pit Bull Terrier Mix    1
## 2                          Angora Mix    1
## 3                       Cairn Terrier    1
## 4   Cairn Terrier/Chihuahua Shorthair    1
## 5              Domestic Shorthair Mix    8
## 6             Flat Coat Retriever Mix    1
## ...
## ...


> df$TrueFalse <- ifelse(df$freq >= 3, df$Breed, F)

> df

                                        Breed freq                    TrueFalse
## 1            American Pit Bull Terrier Mix    1                        FALSE
## 2                               Angora Mix    1                        FALSE
## 3                            Cairn Terrier    1                        FALSE
## 4        Cairn Terrier/Chihuahua Shorthair    1                        FALSE
## 5                   Domestic Shorthair Mix    8       Domestic Shorthair Mix
## 6                  Flat Coat Retriever Mix    1                        FALSE

Answer 4

好吧，您也可以使用base R table來獲取頻率

new_df <- data.frame(table(df$Breed))
#                            Var1              Freq
#1        American Pit Bull Terrier Mix    1
#2                           Angora Mix    1
#3                        Cairn Terrier    1
#4    Cairn Terrier/Chihuahua Shorthair    1
#5               Domestic Shorthair Mix    8
#6              Flat Coat Retriever Mix    1
#7                   Great Pyrenees Mix    1
#8          Lhasa Apso/Miniature Poodle    1
#9              Miniature Schnauzer Mix    1
#10                        Pit Bull Mix    3
#11                   Plott Hound/Boxer    1
#12               Queensland Heeler Mix    1
#13               Shetland Sheepdog Mix    1
#14               Yorkshire Terrier Mix    1

然后使用ifelse獲取TrueFalse列的值

new_df$TrueFalse <- ifelse(new_df$Freq > 2, as.character(new_df$Var1), "F")

#                                 Var1  Freq                TrueFalse
#1        American Pit Bull Terrier Mix    1                        F
#2                           Angora Mix    1                        F
#3                        Cairn Terrier    1                        F
#4    Cairn Terrier/Chihuahua Shorthair    1                        F
#5               Domestic Shorthair Mix    8   Domestic Shorthair Mix
#6              Flat Coat Retriever Mix    1                        F
#7                   Great Pyrenees Mix    1                        F
#8          Lhasa Apso/Miniature Poodle    1                        F
#9              Miniature Schnauzer Mix    1                        F
#10                        Pit Bull Mix    3             Pit Bull Mix
#11                   Plott Hound/Boxer    1                        F
#12               Queensland Heeler Mix    1                        F
#13               Shetland Sheepdog Mix    1                        F
#14               Yorkshire Terrier Mix    1                        F

Answer 5

如果我們需要匯總輸出，則

library(data.table)
setDT(df)[, .(Frequency = .N, TrueFalse = .N > 55), by = Breed]

使用R在數據集本身中為列分配值

問題描述

5 個解決方案

解決方案1
2 已采納 2016-07-28 02:50:33

解決方案2
2 2016-07-28 02:59:55

解決方案3
2 2016-07-28 03:00:41

解決方案4
0 2016-07-28 04:16:39

解決方案5
0 2016-07-28 04:28:08

使用R在數據集本身中為列分配值

問題描述

5 個解決方案

解決方案1 2 已采納 2016-07-28 02:50:33

解決方案2 2 2016-07-28 02:59:55

解決方案3 2 2016-07-28 03:00:41

解決方案4 0 2016-07-28 04:16:39

解決方案5 0 2016-07-28 04:28:08

解決方案1
2 已采納 2016-07-28 02:50:33

解決方案2
2 2016-07-28 02:59:55

解決方案3
2 2016-07-28 03:00:41

解決方案4
0 2016-07-28 04:16:39

解決方案5
0 2016-07-28 04:28:08