简体   繁体   English

r累积图简单化及其应称为什么

[英]r cumulative plot simplication and what should it be called

I have a numeric vector data . 我有一个数字矢量data I need to gather the following data, ie, histogram, but in a cumulative sense. 我需要收集以下数据,即直方图,但要具有累积意义。

a=c()
s=seq(0,1000,10)
for(i in s)
{
    a<-c(a,length(data[data>=i]))
}
plot(s,a)

How can I make this vectorized, and what should this operation be called? 我该如何使其向量化,该操作应称为什么? It is currently not very good, because I have to know the range in order to write s in the above, is there any existing function in R that does this operation? 当前效果不是很好,因为我必须知道范围才能在上面写sR中是否存在执行此操作的现有函数?

Thank you. 谢谢。

Something like this?? 这样的事情?

set.seed(1)          # for reproducible example
data <- rnorm(100)   # random sample from N(0,1)
par(mfrow=c(1,2))    # set up graphics device for 2 plots

z <- hist(data,ylab="Counts",main="Histogram")
barplot(cumsum(z$counts), names.arg=z$breaks[-1],main="Cuml. Histogram")

This takes advantage of the fact that the hist(...) function not only plots a histogram, but returns and object of type histogram . 这利用了hist(...)函数不仅绘制直方图,而且还返回histogram类型的对象这一事实。 This object has elements $breaks containing upper and lower limits on the histogram bins, and $counts containing the count of data in each bin. 此对象的元素$breaks包含直方图bin的上限和下限, $counts包含每个bin中的数据计数。 The cumsum function calculates the cumulative sum. cumsum函数计算累积和。 So the plot on the right is just the cumulative sum of the counts vs. the breaks. 因此,右侧的图只是计数与中断之间的累计总和。

Another, slightly simpler way to do this is to "hack" the histogram object returned by hist(...) and then use plot(...) on that: 另一种稍微简单一些的方法是“入侵” hist(...)返回的直方图对象,然后在其上使用plot(...)

z <- hist(data,ylab="Counts",main="Histogram")
z$counts <- cumsum(z$counts)
plot(z, main="Cuml. Histogram")

Finally, ecdf(...) (empirical cumulative distribution function) returns a function that can be plotted easily. 最后, ecdf(...) (经验累积分布函数)返回可以轻松绘制的函数

plot(ecdf(data))

在此处输入图片说明

I would convert to factors with as many levels as you want bins, and then use table and cumsum on that. 我将转换为具有所需水平的因子,然后在其上使用tablecumsum

For example: 例如:

# Create some fake data:
> tst = sample(1:50,10)
> tst
 [1] 33  7 13 19  1 18 39 15 21 25

# create a vector of factors with all possible levels from "min(tst)" until "max(tst)":
> tst2 = factor(as.character(tst),levels=paste0(min(tst):max(tst)))
> tst2
 [1] 33 7  13 19 1  18 39 15 21 25
39 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 39

# finally, get in one (vectorized) operation the distribution of values >= levels (for each level):
> cumsum(table(tst2))
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
 1  1  1  1  1  1  2  2  2  2  2  2  3  3  4  4  4  5  6  6  7  7  7  7  8  8  8  8 
29 30 31 32 33 34 35 36 37 38 39 
 8  8  8  8  9  9  9  9  9  9 10 

Does this help? 这有帮助吗?

edit: 编辑:

I just realized that this gives you the number of items whose value is less than a given threshold. 我刚刚意识到,这为您提供了值小于给定阈值的项目数。 You can obtain what you want with: 您可以通过以下方式获得想要的东西:

> tst3 = rev(cumsum(table(tst2)))
> names(tst3) = rev(names(tst3))
> tst3
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
10  9  9  9  9  9  9  8  8  8  8  8  8  8  8  7  7  7  7  6  6  5  4  4  4  3  3  2 
29 30 31 32 33 34 35 36 37 38 39 
 2  2  2  2  2  1  1  1  1  1  1 

edit 2: 编辑2:

Much simpler in fact: 实际上简单得多:

> sapply(min(tst):max(tst), function(x)sum(tst>=x))
 [1] 10  9  9  9  9  9  9  8  8  8  8  8  8  7  7  6  6  6  5  4  4  3  3  3  3  2
[27]  2  2  2  2  2  2  2  1  1  1  1  1  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM