简体   繁体   中英

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.

**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**

I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?

library(dplyr)
treatment <- viewing_data %>% 
  distinct(unique_person_id) %>% 
  mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))

viewing_data %>% 
  left_join(treatment, by = "unique_person_id")

You can change the way of sampling if you need to...

You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:

library(dplyr)
viewing_data %>% 
group_by(unique_person_id) %>% 
mutate(group=sample(c("treated","control"),1))

# A tibble: 10,000 x 3
# Groups:   unique_person_id [100]
   unique_person_id tv_show          group  
              <int> <chr>            <chr>  
 1                9 Drive to Survive control
 2               64 Shetland         treated
 3               90 The Crown        treated
 4               93 Drive to Survive treated
 5               17 Duty-Shame       treated
 6               29 The Crown        control
 7               84 Broadchurch      control
 8               83 The Crown        treated
 9                3 The Crown        control
10               33 Broadchurch      control
# … with 9,990 more rows

We can check our results, all of the ids have only 1 group of treated / control:

newdata <- viewing_data %>% 
    group_by(unique_person_id) %>% 
    mutate(group=sample(c("treated","control"),1))

tapply(newdata$group,newdata$unique_person_id,n_distinct)
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 

In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.

library(dplyr)

Persons <- viewing_data %>%
  distinct(unique_person_id) %>%
  mutate(group=sample(100),  # in case the ids are not truly random
         group=ifelse(group %% 2 == 0, 0, 1))  # works if only two groups
Persons

# A tibble: 100 x 2
   unique_person_id group
              <int> <dbl>
 1                1     0
 2                2     0
 3                3     1
 4                4     0
 5                5     1
 6                6     1
 7                7     1
 8                8     0
 9                9     1
10               10     0
# ... with 90 more rows

And to check that we've got 50 in each group:

Persons %>% count(group)

# A tibble: 2 x 2
  group     n
  <dbl> <int>
1     0    50
2     1    50

You could also use the randomizr package, which has many more features apart from complete random allocation.

library(randomizr)

Persons <- viewing_data %>%
  distinct(unique_person_id) %>%
  mutate(group=complete_ra(N=100, m=50))

Persons %>% count(group) # Check

To link this back to the viewing_data, use inner_join .

viewing_data %>% inner_join(Persons, by="unique_person_id")

# A tibble: 10,000 x 3
   unique_person_id tv_show          group
              <int> <chr>            <int>
 1               10 Shetland             1
 2               95 Broadchurch          0
 3                7 Duty-Shame           1
 4               68 Drive to Survive     0
 5               17 Drive to Survive     1
 6               70 Shetland             0
 7               78 Drive to Survive     0
 8               21 Broadchurch          1
 9               80 The Crown            0
10               70 Shetland             0
# ... with 9,990 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM