I've got a dataset which represents 50,000 simulations. Each simulation has multiple scenario id's and associated with each scenario id is a second identifier called target. The first four simulations might look like the following:
+----------------------------------------------+
| SIMULATION |SCENARIO ID |TARGET ID |
| | | |
+----------------------------------------------+
| | | |
| 1 | 12 | 11 |
| 1 | 10 | 2 |
| 1 | 1 | 18 |
| 2 | 3 | 9 |
| 2 | 7 | 10 |
| 2 | 21 | 2 |
| 3 | 17 | 15 |
| 3 | 12 | 9 |
| 4 | 7 | 16 |
+---------------+--------------+---------------+
I want to sample down this 50,000 simulation set into a 10,000 simulation set, while retaining the best possible representation of the 50,000 set in respect of the frequency of each scenario / target combination.
I've tried using stratified sampling using the stratified function in the splitstackshape package and setting the scenario id and target id as a group. However I can only specify the sample size of each group.
I can play with the proportion sampled from each group until it gets close to 10,000 simulations but that's not ideal as I need this to be as automated as possible.
If it is not too late, I may propose the following solution.
First , load the library and generate dataset (of course there is no need to generate dataset in your case):
library(data.table)
# Generate dataset ...
df = data.table(Simulation = sample(1:4, 60, replace = TRUE),
Scenario.ID = sample(1:5, 60, replace = TRUE),
Target.ID = sample(1:2, 60, replace = TRUE))
# ... and sort it
df = df[order(Simulation, Scenario.ID, Target.ID)]
Second , define the decreasing ratio. In this example, I am using n = 3, in your case, it will be n = 5 or any other number that fits the goal.
n = 3
Third , define the number of rows to be taken from each combination of scenario and target. I round numbers; they must be integers. If the rounded number is zero, then 1 is taken as a sample to keep the representation of every combination of scenarios and targets.
group.sample = df[, .N, by = .(Scenario.ID, Target.ID)][, pmax(round(N/n), 1)]
group.sample
[1] 1 2 2 2 2 2 3 2 3 1
Fourth , mark records to be taken into the sample (thanks to this answer). I use set.seed to make the example reproducible. The selection is random.
set.seed(1)
df[, Sample := 1:.N %in% sample(.N, min(.N, group.sample[.GRP])), by = .(Scenario.ID, Target.ID)]
head(df[order(Simulation, Scenario.ID, Target.ID)])
Simulation Scenario.ID Target.ID Sample
1: 1 1 1 FALSE
2: 1 1 1 TRUE
3: 1 1 2 FALSE
4: 1 2 1 FALSE
5: 1 2 2 FALSE
6: 1 3 1 FALSE
Fifth , compare the original proportion of scenario and target combination with the sampled one. The proportions are rounded to two digits after the comma.
df[, .(Original = round(.N/ nrow(df), 2),
Sampled = round(length(Sample[Sample == TRUE])/df[Sample == TRUE, .N], 2)),
by = .(Scenario.ID, Target.ID)]
Scenario.ID Target.ID Original Sampled
1: 1 1 0.07 0.05
2: 1 2 0.10 0.10
3: 2 1 0.10 0.10
4: 2 2 0.08 0.10
5: 3 1 0.12 0.10
6: 4 1 0.08 0.10
7: 4 2 0.15 0.15
8: 5 1 0.08 0.10
9: 3 2 0.17 0.15
10: 5 2 0.05 0.05
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.