[英]R - Frequency histogram from sampling: efficiency and more
I'm a university student, beginning to explore R for an exam. 我是一名大学生,开始探索R进行考试。 Sorry for the vague title, as I have many questions related to this post.
抱歉,标题含糊,因为我对此帖子有很多疑问。
I've run into the problem of sampling a population of people who are either Male (M) or Female (F). 我遇到了对男性(M)或女性(F)的人群进行抽样的问题。 I wished to define a function that could take the number of Males and Females in this population, then create
sample.number
samples of size sample.size
and return a data frame containing the sample proportions of females over the total size of the sample, with related frequencies. 我想限定,可以采取男性和女性的数目在该人群中,则创建一个功能
sample.number
大小的样品sample.size
并返回包含女性的样本比例在样品的总大小的数据帧,以相关频率。
I'm positive there is a simple and well-optimized way to do this, but I've written a small function that (barely) works: 我很肯定有一种简单且经过优化的方法可以做到这一点,但是我编写了一个很小的函数,(几乎)可以正常工作:
senators <- function(Fem = 13,
Mal = 87,
sample.size = 10,
sample.number = 100){
pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base
popsa <- list(NA) # I make some empty variables used later
popsa.factor <- list(NA) # Not sure if this passage is even needed...
popsa.proportion <- list(NA)
Here comes a for
loop. 这是一个
for
循环。 I've read that for
loops are really inefficient way to do this. 我读过,
for
循环实际上是效率低下的方法。 Is there a better way? 有没有更好的办法?
for(i in 1:sample.number){
popsa[[i]] <- sample(pop, sample.size, replace = TRUE)
popsa.factor[[i]] <- table(factor(popsa[[i]], levels = c("M", "F")))
popsa.proportion[[i]] <- popsa.factor[[i]][2]/sample.size
}
I start by assigning each element of the list popsa
with a sample, then I use popsa
to create a table from each sample, and store it in popsa.factor
. 首先,给列表
popsa
每个元素分配一个示例,然后使用popsa
从每个示例创建一个表,并将其存储在popsa.factor
。 Then I calculate the proportions of females over the total and store it in popsa.proportion
. 然后,我计算女性在总人数中所占的比例,并将其存储在
popsa.proportion
。 This for
loop seems super messy to me, and is really slow to process lots of samples. for
我来说,这个for
循环超级混乱,并且处理许多样本真的很慢。 Is there a better, more efficient way to do what I've done here? 有没有更好,更有效的方法来完成我在这里所做的工作?
popsa.unlisted <- unlist(popsa.proportion)
popsa.frequency <- table(popsa.unlisted)
popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)),
Freq = as.numeric(popsa.frequency))
return(popsa.frame)
} # This closes the function call
I then unlist popsa.proportion
to get every proportion in a vector, and table those values to get the frequencies, storing them into popsa.frequency
. 然后,我取消列出
popsa.proportion
以获取向量中的每个比例,并列出这些值以获取频率,并将其存储到popsa.frequency
。 Now I try to turn the factor popsa.frequency
into a data frame, by cheating and converting the names of popsa.frequency
as numeric and storing them as the first column of the data frame. 现在,我通过将
popsa.frequency
的名称作弊并将其转换为数字并将其存储为数据帧的第一列,尝试将因素popsa.frequency
转换为数据帧。 The function then returns popsa.frame
, as I wanted. 然后,该函数根据需要返回
popsa.frame
。
popsa.frame
, though, still carries over the factor properties of popsa.frequency
in its first column ( Level
). 不过,
popsa.frame
仍在其第一列( Level
)中popsa.frequency
的因子属性。 How can I change this? 我该如何更改? Should I?
我是不是该?
Since these are frequencies of a sample distribution, I'd like to create an histogram from this dataframe, although hist()
only accepts numeric vectors, so popsa.frame
isn't a valid object. 由于这些是样本分布的频率,因此我想从此数据帧创建直方图,尽管
hist()
仅接受数字矢量,所以popsa.frame
不是有效的对象。 plot(popsa.frame)
returns more or less what I want, though. plot(popsa.frame)
或多或少返回我想要的。 How can I create such an histogram? 如何创建这样的直方图?
Edit: Following the marked answer below, I've also come up on how to simply convert the data frame the function creates into an object that hist()
can actually use to create a frequency histogram (although using a barplot yields more or less the same graph, and possibly be a more statistically correct way to show such a result): 编辑:按照下面的标记答案,我还想出了如何将函数创建的数据帧简单地转换为
hist()
实际可用于创建频率直方图的对象(尽管使用条形图或多或少会产生相同的图表,并且可能是显示此类结果的更统计正确的方式):
result <- senators(Fem=13,Mal=87,sample.size=50,sample.number=10000)
raw <- sapply(1:length(result$Level), function(x){
rep(result$Level, result$Freq)
})
hist(raw)
Your function has some default values that leads to the creation of a data.frame
by just doing senators()
. 您的函数具有一些默认值,
data.frame
通过执行senators()
即可创建data.frame
。
Following your data I would do: 根据您的数据,我会做:
df <- senators() # using default values
plot(df, type="h", lwd = 5, lend=1) # type changes your plot type while lwd changes line sizes, while lend would give squared aspect yo your bars.
Take a look at ?plot
to see the types of plots you can do. 查看
?plot
以查看可以执行的绘图类型。 Also, you can see how change parameters by doing ?par
. 另外,您还可以通过执行
?par
来查看如何更改参数。
PS: look at this post for line width details. PS:请看这篇文章以了解线宽细节。
The creation of the lists and the for loop has some performance bottlenecks. 列表和for循环的创建存在一些性能瓶颈。 I was able to use
sapply
to remove the for loop
and some of the temporary variables. 我能够使用
sapply
删除for loop
和一些临时变量。
I am still returning the data fame and another option would return the vector answer just pass the result to the histogram plotting function for your final plot. 我仍在返回数据名声,另一种选择将返回矢量答案,只需将结果传递给最终绘图的直方图绘图函数即可。
senators <- function(Fem = 13,
Mal = 87,
sample.size = 10,
sample.number = 100){
pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base
answer<-sapply(1:sample.number, function(x){popsa <- sample(pop, sample.size, replace = TRUE);
length(popsa[popsa=="F"])/sample.size})
popsa.frequency <- table(answer)
popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)),
Freq = as.numeric(popsa.frequency))
return(popsa.frame)
}
senators()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.