简体   繁体   English

R函数/方法使用概率对数据帧进行采样,直到达到条件

[英]R function/method to sample data frame using probability until condition is reached

I have a data frame with 3 columns:我有一个包含 3 列的数据框:

ObjectID: the unique identifier of a polygon (or row) AvgWTRisk: probability (0-1) of a disturbance in a forest, ~0.11 is the highest value HA: AREA of a polygon in the forest ObjectID:多边形(或行)的唯一标识符 AvgWTRisk:森林中干扰的概率(0-1),~0.11 是最高值 HA:森林中多边形的面积

I want to develop a function to create a random sample from the data frame, based on the probability value.我想开发一个函数来根据概率值从数据框中创建一个随机样本。 Here's an example of the data structure:下面是一个数据结构的例子:

data数据

      OBJECTID AvgWTRisk        HA
32697    32697 0.0008456 7.7465000
36480    36480 0.0050852 7.9329797
13805    13805 0.0173463 0.7154995
38796    38796 0.0026580 0.2882192
8494      8494 0.0089310 6.4686595
23609    23609 0.0090647 6.1246000

Dput输出

structure(list(OBJECTID = c(32697L, 36480L, 13805L, 38796L, 8494L, 
23609L), AvgWTRisk = c(0.0008456, 0.0050852, 0.0173463, 0.002658, 
0.008931, 0.0090647), HA = c(7.7465, 7.9329797, 0.7154995, 0.2882192, 
6.4686595, 6.1246)), row.names = c(32697L, 36480L, 13805L, 38796L, 
8494L, 23609L), class = "data.frame")

I am attempting to do this using the sample() function in R.我正在尝试使用 R 中的 sample() 函数来做到这一点。

Is there any way to use the sum of area as my 'size = ' target as opposed to a number of rows, as such:有什么方法可以使用面积总和作为我的 'size = ' 目标而不是行数,例如:

Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = sum(HA >= 100*0.95 && HA <= 100*1.05),
                                                 prob = WTProb, replace = FALSE),]

where: WTProb is as vector of AvgWTRisk, ie 'WTProb <- as.vector(Landscape_WTRisk$AvgWTRisk' and HA is the area column from the data frame.其中: WTProb 作为 AvgWTRisk 的向量,即 'WTProb <- as.vector(Landscape_WTRisk$AvgWTRisk' 并且 HA 是数据框的面积列。

The sample selection above provides me a dataframe with all of the columns but no rows.上面的示例选择为我提供了一个包含所有列但没有行的数据框。

As opposed to:与之相反:

Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = 10,
                                                 prob = WTProb, replace = FALSE),]

Which works in providing a sample of 10 rows.这适用于提供 10 行的样本。 However, I have no control over the area being selected.但是,我无法控制选择的区域。

Should I try to achieve this with a while loop, where the area of all of the rows summed together is the criteria, and a small selection of rows can be incrementally added together until the target is reached?我是否应该尝试使用 while 循环来实现这一点,其中所有行的面​​积总和是标准,并且可以将一小部分行增量添加在一起直到达到目标?

Thank you in advance!先感谢您!

I hope I understand what you are asking.我希望我明白你在问什么。 The following code will first create a permutation of your data in such a way that rows with higher AvgWTRisk will end up closer to the top of the table.以下代码将首先创建数据的排列,这样具有较高 AvgWTRisk 的行最终将更靠近表的顶部。 In a second step, rows in the middle of the table will be selected based on the sum of HA being in a certain range.在第二步中,将根据特定范围内的 HA 总和来选择表中间的行。

set.seed(123)
WTProb <- Landscape_WTRisk$AvgWTRisk
Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = nrow(Landscape_WTRisk),
                                                 prob = WTProb, replace = FALSE),]
Landscape_WTDisturbed$HA.sum = cumsum(Landscape_WTDisturbed$HA)
HA.sum.min = 10
HA.sum.max = 25
Landscape_WTDisturbed = Landscape_WTDisturbed[
    Landscape_WTDisturbed$HA.sum >= HA.sum.min &
    Landscape_WTDisturbed$HA.sum <= HA.sum.max,]
Landscape_WTDisturbed
##       OBJECTID AvgWTRisk        HA   HA.sum
## 23609    23609 0.0090647 6.1246000 14.77308
## 38796    38796 0.0026580 0.2882192 15.06130
## 32697    32697 0.0008456 7.7465000 22.80780

I've attempted as such:我试过这样:

WTProb <- Landscape_WTRisk$AvgWTRisk
Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = 1000,
                                                 prob = WTProb, replace = FALSE),]
Landscape_WTDisturbed$HA.sum = cumsum(Landscape_WTDisturbed$HA)

Landscape_WTDisturbed <- Landscape_WTDisturbed[Landscape_WTDisturbed$HA.sum<=DisturbanceArea*1.05,]

Using the cumsum value to add up the values of the HA column, and then select all of the rows that add up to the total 'target'.使用 cumsum 值将 HA 列的值相加,然后选择相加为“目标”总数的所有行。 I can confirm that this approach, a derivative from that recommended by BigFinger - thank you, does produce appropriate results.我可以确认这种方法是 BigFinger 推荐的方法的衍生物 - 谢谢,确实会产生适当的结果。 See below见下文

1) The full samples distribution of risk 1) 全样本风险分布

summary(Landscape_WTRisk$AvgWTRisk)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000286 0.0013508 0.0030834 0.0061175 0.0072636 0.121604

2) The sample distribution of risk 2)风险样本分布

summary(Landscape_WTDisturbed$AvgWTRisk)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002977 0.006563 0.010800 0.014997 0.015196 0.045924

As you can tell, the distribution was influenced by the probability of the original sample of 1000, sampling rows with substantially higher AvgWTRisk than the distribution in the original dataset.如您所见,分布受原始样本概率为 1000 的影响,采样行的 AvgWTRisk 远高于原始数据集中的分布。

This approach would not work if more than 1000 samples were needed to the cumulative sum of the target.如果目标的累积总和需要超过 1000 个样本,则此方法将不起作用。 Still not sure how to make it work more dynamically, if the 'DisturbanceArea' target were to grow beyond the ability of the 1000 sample to meet, this approach would fall apart.仍然不确定如何使其更动态地工作,如果“DisturbanceArea”目标增长到超过 1000 个样本满足的能力,这种方法就会失败。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM