简体   繁体   English

R-采样频率直方图:效率更高

[英]R - Frequency histogram from sampling: efficiency and more

I'm a university student, beginning to explore R for an exam. 我是一名大学生,开始探索R进行考试。 Sorry for the vague title, as I have many questions related to this post. 抱歉,标题含糊,因为我对此帖子有很多疑问。

I've run into the problem of sampling a population of people who are either Male (M) or Female (F). 我遇到了对男性(M)或女性(F)的人群进行抽样的问题。 I wished to define a function that could take the number of Males and Females in this population, then create sample.number samples of size sample.size and return a data frame containing the sample proportions of females over the total size of the sample, with related frequencies. 我想限定,可以采取男性和女性的数目在该人群中,则创建一个功能sample.number大小的样品sample.size并返回包含女性的样本比例在样品的总大小的数据帧,以相关频率。

I'm positive there is a simple and well-optimized way to do this, but I've written a small function that (barely) works: 我很肯定有一种简单且经过优化的方法可以做到这一点,但是我编写了一个很小的函数,(几乎)可以正常工作:

senators <- function(Fem = 13, 
                 Mal = 87, 
                 sample.size = 10, 
                 sample.number = 100){

pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base

popsa <- list(NA)           # I make some empty variables used later
popsa.factor <- list(NA)    # Not sure if this passage is even needed...
popsa.proportion <- list(NA)

Here comes a for loop. 这是一个for循环。 I've read that for loops are really inefficient way to do this. 我读过, for循环实际上是效率低下的方法。 Is there a better way? 有没有更好的办法?

for(i in 1:sample.number){
  popsa[[i]] <- sample(pop, sample.size, replace = TRUE)
  popsa.factor[[i]] <- table(factor(popsa[[i]], levels = c("M", "F")))
  popsa.proportion[[i]] <- popsa.factor[[i]][2]/sample.size
  }

I start by assigning each element of the list popsa with a sample, then I use popsa to create a table from each sample, and store it in popsa.factor . 首先,给列表popsa每个元素分配一个示例,然后使用popsa从每个示例创建一个表,并将其存储在popsa.factor Then I calculate the proportions of females over the total and store it in popsa.proportion . 然后,我计算女性在总人数中所占的比例,并将其存储在popsa.proportion This for loop seems super messy to me, and is really slow to process lots of samples. for我来说,这个for循环超级混乱,并且处理许多样本真的很慢。 Is there a better, more efficient way to do what I've done here? 有没有更好,更有效的方法来完成我在这里所做的工作?

popsa.unlisted <- unlist(popsa.proportion)
popsa.frequency <- table(popsa.unlisted)

popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)), 
                          Freq =  as.numeric(popsa.frequency))
return(popsa.frame)
} # This closes the function call

I then unlist popsa.proportion to get every proportion in a vector, and table those values to get the frequencies, storing them into popsa.frequency . 然后,我取消列出popsa.proportion以获取向量中的每个比例,并列出这些值以获取频率,并将其存储到popsa.frequency Now I try to turn the factor popsa.frequency into a data frame, by cheating and converting the names of popsa.frequency as numeric and storing them as the first column of the data frame. 现在,我通过将popsa.frequency的名称作弊并将其转换为数字并将其存储为数据帧的第一列,尝试将因素popsa.frequency转换为数据帧。 The function then returns popsa.frame , as I wanted. 然后,该函数根据需要返回popsa.frame

popsa.frame , though, still carries over the factor properties of popsa.frequency in its first column ( Level ). 不过, popsa.frame仍在其第一列( Level )中popsa.frequency的因子属性。 How can I change this? 我该如何更改? Should I? 我是不是该?

Since these are frequencies of a sample distribution, I'd like to create an histogram from this dataframe, although hist() only accepts numeric vectors, so popsa.frame isn't a valid object. 由于这些是样本分布的频率,因此我想从此数据帧创建直方图,尽管hist()仅接受数字矢量,所以popsa.frame不是有效的对象。 plot(popsa.frame) returns more or less what I want, though. plot(popsa.frame)或多或少返回我想要的。 How can I create such an histogram? 如何创建这样的直方图?

Edit: Following the marked answer below, I've also come up on how to simply convert the data frame the function creates into an object that hist() can actually use to create a frequency histogram (although using a barplot yields more or less the same graph, and possibly be a more statistically correct way to show such a result): 编辑:按照下面的标记答案,我还想出了如何将函数创建的数据帧简单地转换为hist()实际可用于创建频率直方图的对象(尽管使用条形图或多或少会产生相同的图表,并且可能是显示此类结果的更统计正确的方式):

result <- senators(Fem=13,Mal=87,sample.size=50,sample.number=10000)

raw <- sapply(1:length(result$Level), function(x){
  rep(result$Level, result$Freq)
})

hist(raw)

Your function has some default values that leads to the creation of a data.frame by just doing senators() . 您的函数具有一些默认值, data.frame通过执行senators()即可创建data.frame

Following your data I would do: 根据您的数据,我会做:

df <- senators() # using default values
plot(df, type="h", lwd = 5, lend=1) # type changes your plot type while lwd changes line sizes, while lend would give squared aspect yo your bars.

Take a look at ?plot to see the types of plots you can do. 查看?plot以查看可以执行的绘图类型。 Also, you can see how change parameters by doing ?par . 另外,您还可以通过执行?par来查看如何更改参数。

PS: look at this post for line width details. PS:请看这篇文章以了解线宽细节。

The creation of the lists and the for loop has some performance bottlenecks. 列表和for循环的创建存在一些性能瓶颈。 I was able to use sapply to remove the for loop and some of the temporary variables. 我能够使用sapply删除for loop和一些临时变量。

I am still returning the data fame and another option would return the vector answer just pass the result to the histogram plotting function for your final plot. 我仍在返回数据名声,另一种选择将返回矢量答案,只需将结果传递给最终绘图的直方图绘图函数即可。

senators <- function(Fem = 13, 
                     Mal = 87, 
                     sample.size = 10, 
                     sample.number = 100){

  pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base

  answer<-sapply(1:sample.number, function(x){popsa <- sample(pop, sample.size, replace = TRUE);
                                            length(popsa[popsa=="F"])/sample.size})

popsa.frequency <- table(answer)

popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)), 
                          Freq =  as.numeric(popsa.frequency))
return(popsa.frame)
} 

senators()   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM