简体   繁体   English

使用R在数据集本身中为列分配值

[英]Assign column with value in the dataset itself using R

I'm quite new to R. I got a column of data where there are around 26000 data and the column contains around 1200 unique data. 我对R很陌生。我有一列数据,其中包含大约26000个数据,该列包含大约1200个唯一数据。 Let's assume the name of the column as 'Breed'. 让我们假设该列的名称为“ Breed”。

What I required is, 我需要的是

  1. I need to get the frequency for each unique value in the column. 我需要获取列中每个唯一值的频率。

    I have extracted the BreedType and the frequency as shown below. 我提取了BreedType和频率,如下所示。 (The breed column is given the name as BreedType ) (品种列的名称为BreedType)

  2. Then if the frequency of each BreedType is less than 50, using an if condition I need to have a new column with 'F' and if greater than 50 need to assign the column with the value of 'Breedtype'. 然后,如果每个BreedType的频率小于50,则使用if条件,我需要使用'F'创建一个新列,如果大于50,则需要为该列分配'Breedtype'的值。

Here is what I have tried. 这是我尝试过的。

x<- sort(table(full$Breed),decreasing=T)
w=as.data.frame(x)

names(w)[1] = 'BreedType'

w$TrueFalse<-ifelse(w$Freq<50,F,w$BreedType)
w$TrueFalse

But the output given is not what I expected. 但是给出的输出不是我期望的。 Though the F assign each column correctly, w$BreedType doesn't get the value of BreedType but integers which increase one by one instead of giving the specific BreedType . 尽管F正确分配了每一列,但w $ BreedType不会获得BreedType的值,而是将整数加1而不是给出特定的BreedType的整数。

Can someone please explain me why the output is not given as expected. 有人可以解释一下为什么输出不如预期吗。

The Breed Column looked like below in the dataset with 20,000 rows and 1200 unique values. “品种列”在数据集中看起来像下面,具有20,000行和1200个唯一值。

 Breed

 Shetland Sheepdog Mix
 Domestic Shorthair Mix
 Pit Bull Mix
 Domestic Shorthair Mix
 Lhasa Apso/Miniature Poodle
 Cairn Terrier/Chihuahua Shorthair
 Domestic Shorthair Mix
 Domestic Shorthair Mix
 American Pit Bull Terrier Mix
 Cairn Terrier
 Domestic Shorthair Mix
 Miniature Schnauzer Mix
 Pit Bull Mix
 Yorkshire Terrier Mix
 Great Pyrenees Mix
 Domestic Shorthair Mix
 Domestic Shorthair Mix
 Pit Bull Mix
 Angora Mix
 Flat Coat Retriever Mix
 Queensland Heeler Mix
 Domestic Shorthair Mix
 Plott Hound/Boxer

My expected outcome is, 我的预期结果是

BreedType                   Frequency   TrueFalse

Shetland Sheepdog Mix       60          Shetland Sheepdog Mix  
Domestic Shorthair Mix      20          F
Pit Bull Mix                80          Pit Bull Mix
Domestic Shorthair Mix      10          F

Raw Data - full data frame: 原始数据- full数据帧:

> full
#                      Breed
# 1:             Shetland Sheepdog Mix
# 2:            Domestic Shorthair Mix
# 3:                      Pit Bull Mix
# 4:            Domestic Shorthair Mix
# 5:       Lhasa Apso/Miniature Poodle
# 6: Cairn Terrier/Chihuahua Shorthair
# 7:            Domestic Shorthair Mix
# 8:            Domestic Shorthair Mix
# 9:     American Pit Bull Terrier Mix
# 10:                     Cairn Terrier
# 11:            Domestic Shorthair Mix
# 12:           Miniature Schnauzer Mix
# 13:                      Pit Bull Mix
# 14:             Yorkshire Terrier Mix
# 15:                Great Pyrenees Mix
# 16:            Domestic Shorthair Mix
# 17:            Domestic Shorthair Mix
# 18:                      Pit Bull Mix
# 19:                        Angora Mix
# 20:           Flat Coat Retriever Mix
# 21:             Queensland Heeler Mix
# 22:            Domestic Shorthair Mix
# 23:                 Plott Hound/Boxer
# Breed

Load data.table library in your workspace 在工作区中加载data.table库

library("data.table")

Convert full dataframe to data table by reference 通过引用将full数据帧转换为数据表

setDT(full)

Make a copy of full datatable to dt1 datatable. full复制到dt1 This is done to have a backup of full data table 这样做是为了备份full数据表

dt1 <- copy(full)

Group dt1 datatable by BreedType (Breed Column), then access the .N internal variable which stores the number of entries in each subset and do the ifelse condition with it. 通过BreedType(品种列)对dt1进行分组,然后访问.N内部变量,该变量存储每个子集中的条目数,并对其进行ifelse条件。 Then save it as Frequency and TrueFalse column variable. 然后将其另存为Frequency和TrueFalse列变量。

dt1[, c("Frequency", "TrueFalse") := .(.N, ifelse(.N < 50, FALSE, Breed)), by = Breed]

Display dt1 after the above step 在上述步骤之后显示dt1

> dt1
#                          Breed          Frequency TrueFalse
# 1:             Shetland Sheepdog Mix         1     FALSE
# 2:            Domestic Shorthair Mix         8     FALSE
# 3:                      Pit Bull Mix         3     FALSE
# 4:            Domestic Shorthair Mix         8     FALSE
# 5:       Lhasa Apso/Miniature Poodle         1     FALSE
# 6: Cairn Terrier/Chihuahua Shorthair         1     FALSE
# 7:            Domestic Shorthair Mix         8     FALSE
# 8:            Domestic Shorthair Mix         8     FALSE
# 9:     American Pit Bull Terrier Mix         1     FALSE
# 10:                     Cairn Terrier         1     FALSE
# 11:            Domestic Shorthair Mix         8     FALSE
# 12:           Miniature Schnauzer Mix         1     FALSE
# 13:                      Pit Bull Mix         3     FALSE
# 14:             Yorkshire Terrier Mix         1     FALSE
# 15:                Great Pyrenees Mix         1     FALSE
# 16:            Domestic Shorthair Mix         8     FALSE
# 17:            Domestic Shorthair Mix         8     FALSE
# 18:                      Pit Bull Mix         3     FALSE
# 19:                        Angora Mix         1     FALSE
# 20:           Flat Coat Retriever Mix         1     FALSE
# 21:             Queensland Heeler Mix         1     FALSE
# 22:            Domestic Shorthair Mix         8     FALSE
# 23:                 Plott Hound/Boxer         1     FALSE
# Breed Frequency TrueFalse

The data you provided do not have frequency of breedtype greater than 50. If you have one, the breedtype will be added instead of FALSE as per the ifelse statement. 您提供的数据的繁殖类型频率不超过50。如果您有一个,则将根据ifelse语句添加繁殖类型,而不是FALSE。

Assuming that your implementation of frequency per BreedType already works. 假设您已经实现了每个BreedType的频率实现。 This is similar as @Sathish, but using data.frame instead of data.table 这类似于@Sathish,但是使用data.frame而不是data.table

testData <- data.frame(BreedType = c("Shetland Sheepdog Mix", "Domestic Shorthair Mix", "Pit Bull Mix", "Domestic Shorthair Mix"),
                   Frequency = c(60, 20, 80, 10), stringsAsFactors = F)
testData$TrueFalse <- testData$BreedType
testData$TrueFalse[testData$Frequency < 50] <- F 

Output is the same as what you have. 输出与您所拥有的相同。 However, "FALSE" is converted to a string (instead of being a boolean value) because the column was initialized as a character vector. 但是,“ FALSE”将转换为字符串(而不是布尔值),因为该列已初始化为字符向量。 I'm not sure you can have a mix of booleans and strings. 我不确定您是否可以混合使用布尔值和字符串。

You can use count function from plyr package. 您可以使用plyr包中的count功能。 I've demonstrated an example using the data you provided in question. 我已经使用您所提供的数据演示了一个示例。

> library(plyr)

> df <- read.table(text = "Shetland Sheepdog Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Domestic Shorthair Mix
  Lhasa Apso/Miniature Poodle
  Cairn Terrier/Chihuahua Shorthair
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  American Pit Bull Terrier Mix
  Cairn Terrier
  Domestic Shorthair Mix
  Miniature Schnauzer Mix
  Pit Bull Mix
  Yorkshire Terrier Mix
  Great Pyrenees Mix
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Angora Mix
  Flat Coat Retriever Mix
  Queensland Heeler Mix
  Domestic Shorthair Mix
  Plott Hound/Boxer", sep='\n', stringsAsFactors = F, col.names = c('Breed'))

Use plyr::count function. 使用plyr::count函数。

> df <- count(df, 'Breed')

> df 

##                                 Breed freq
## 1       American Pit Bull Terrier Mix    1
## 2                          Angora Mix    1
## 3                       Cairn Terrier    1
## 4   Cairn Terrier/Chihuahua Shorthair    1
## 5              Domestic Shorthair Mix    8
## 6             Flat Coat Retriever Mix    1
## ...
## ...


> df$TrueFalse <- ifelse(df$freq >= 3, df$Breed, F)

> df

                                        Breed freq                    TrueFalse
## 1            American Pit Bull Terrier Mix    1                        FALSE
## 2                               Angora Mix    1                        FALSE
## 3                            Cairn Terrier    1                        FALSE
## 4        Cairn Terrier/Chihuahua Shorthair    1                        FALSE
## 5                   Domestic Shorthair Mix    8       Domestic Shorthair Mix
## 6                  Flat Coat Retriever Mix    1                        FALSE

Well, you can also use base R table to get the frequencies 好吧,您也可以使用base R table来获取频率

new_df <- data.frame(table(df$Breed))
#                            Var1              Freq
#1        American Pit Bull Terrier Mix    1
#2                           Angora Mix    1
#3                        Cairn Terrier    1
#4    Cairn Terrier/Chihuahua Shorthair    1
#5               Domestic Shorthair Mix    8
#6              Flat Coat Retriever Mix    1
#7                   Great Pyrenees Mix    1
#8          Lhasa Apso/Miniature Poodle    1
#9              Miniature Schnauzer Mix    1
#10                        Pit Bull Mix    3
#11                   Plott Hound/Boxer    1
#12               Queensland Heeler Mix    1
#13               Shetland Sheepdog Mix    1
#14               Yorkshire Terrier Mix    1

and then use ifelse to get the value of TrueFalse column 然后使用ifelse获取TrueFalse列的值

new_df$TrueFalse <- ifelse(new_df$Freq > 2, as.character(new_df$Var1), "F")

#                                 Var1  Freq                TrueFalse
#1        American Pit Bull Terrier Mix    1                        F
#2                           Angora Mix    1                        F
#3                        Cairn Terrier    1                        F
#4    Cairn Terrier/Chihuahua Shorthair    1                        F
#5               Domestic Shorthair Mix    8   Domestic Shorthair Mix
#6              Flat Coat Retriever Mix    1                        F
#7                   Great Pyrenees Mix    1                        F
#8          Lhasa Apso/Miniature Poodle    1                        F
#9              Miniature Schnauzer Mix    1                        F
#10                        Pit Bull Mix    3             Pit Bull Mix
#11                   Plott Hound/Boxer    1                        F
#12               Queensland Heeler Mix    1                        F
#13               Shetland Sheepdog Mix    1                        F
#14               Yorkshire Terrier Mix    1                        F

If we need a summarised output, then 如果我们需要汇总输出,则

library(data.table)
setDT(df)[, .(Frequency = .N, TrueFalse = .N > 55), by = Breed]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM