简体   繁体   中英

Subsetting nominal variables in R

I have a data set similar to this one:

x <- sample(c("A", "B", "C", "D"), 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
y <- sample(1:40, 1000, replace=TRUE)
d <- data.frame(x,y)

str(d)
'data.frame':   1000 obs. of  2 variables:
$ x: Factor w/4 levels "A","B","C","D": 1 3 3 2 3 3 3 3 4 3 ...
$ y: int  28 35 14 4 34 36 30 35 26 9 ...

table(d$x)
A   B   C   D 
115 204 637 44 

So in my real data set i have multiple thousands of these category (A, B, C, D).

The str() of my real dataset

str(realdata)
data.frame':    346340 obs. of  91 variables:
$ author       : Factor w/ 42590 levels "-jon-","--LZR--",..: 1962 3434 1241 7666 6235 2391 1196 2779 1881 339 ...
$ created_utc  : Factor w/ 343708 levels "2015-05-01 02:00:41",..: 14815 23163 2281 3569 5922 7211 15783 5512 13485 8591 ...
$ group    : Factor w/ 5 levels "xyz","abc","bnm",..: 2 2 2 2 2 2 2 2 2 2 ...
....

Now i want to subset the data, so i have only the rows of those $authors (or $x in the d dataframe) in my new dataframe that have more than 100 entries in total.

I tried the following:

dnew <- subset(realdata, table(realdata$author) > 100)

It gives me a result, but it seems the not all entries of the authors were included. Although it should be way more, i just get 1.3% of the rows of the complete dataset. I checked it manually (with excel) and it should be way more than that (approx. 30%). The manual analysis showed that 1.2 % of $author stand for 30% of the entries. So it seems he just gave me one row with the $author who has more than 100 entries, but not all of his entries.

Do you know of a way to fix this?

We can do this easily with data.table . Convert the 'data.frame' to 'data.table' ( setDT(d) , grouped by 'x', if the number of observations is greater than 100, we Subset the Data.table ( .SD )

library(data.table)
ddt <- setDT(d)[, if(.N > 100) .SD, x]

Or if we are using dplyr , the same approach can be used.

library(dplyr)
dpl <- d %>%
        group_by(x) %>%
        filter(n() > 100) %>%
        droplevels()
str(dpl)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 866 obs. of  2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
# $ y: int  25 25 13 11 2 32 12 15 12 3 ...

Also, in using the base R , the table can be helpful

v1 <- table(d$x)
d1 <- subset(d, x %in% names(v1)[v1 > 100])

As the column 'x' is factor , when we subset the dataset, the levels persist, to remove that use droplevels

d2 <- droplevels(d1)

As the OP didn't set the seed , the output will be different for each person.

str(d2)
#'data.frame':   866 obs. of  2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
#$ y: int  25 25 13 11 2 32 12 15 12 3 ...

I. Data frame d with four levels

  table(d$x)

  #  A   B   C   D 
  # 92 232 630  46 

II. Checking which level has greater than 100 records

  which(table(d$x)>100)
  # B C 
  # 2 3 

III. Subsetting d data frame having only records belonging to levels which have greater than 100 records ie. level B and level C

  result <- d[ d$x %in%  names(table(d$x))[table(d$x) > 100] , ]
  dim(result)
  # [1] 862   2

  str(result)
  # 'data.frame':   862 obs. of  2 variables:
  #  $ x: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 2 3 3 3 ...
  #  $ y: int  29 32 27 40 30 38 8 16 2 23 ...

Level A and D still persists with 0 records

  table(result$x)

  #   A   B   C   D 
  #   0 232 630   0 

IV. Removing the levels with 0 records using factor()

  result$x <- factor(result$x)

  str(result)
  # 'data.frame':   860 obs. of  2 variables:
  #  $ x: Factor w/ 2 levels "B","C": 2 2 1 2 1 2 2 2 1 2 ...
  #  $ y: int  29 32 27 40 30 38 8 16 2 23 ...

  table(result$x)
  #   B   C 
  # 232 630 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM