简体   繁体   English

如何在更大的数据集中为子组的子组生成汇总统计数据(类似于 psych::describeBy)?

[英]How to generate summary statistics (similar to psych::describeBy) for subgroups of subgroups, within a larger dataset?

New to R (for biostats) here! R 的新手(对于 biostats)在这里! I have a huge dataset, and am using describe() and describeBy() from the psych package.我有一个巨大的数据集,正在使用 psych 包中的describe()describeBy() But I'm also trying to find a way to do basic stats for subgroups within subgroups.但我也在尝试找到一种方法来为子组中的子组进行基本统计。

For example, my dataset is about membership within a club, which has Chinese and Indian members.例如,我的数据集是关于一个俱乐部的会员资格,该俱乐部有中国和印度会员。 Other variables include gender, age, height, weight, BMI, etcetera.其他变量包括性别、年龄、身高、体重、BMI 等。

I have figured out psych::describeBy to look at means and standard deviation for subgroups defined by one variable, eg ethnicity, but I can't figure out how to narrow this down further so that I generate a summary only for Chinese male members.我已经找到了psych::describeBy来查看由一个变量(例如种族)定义的子组的均值和标准差,但我无法弄清楚如何进一步缩小范围,以便我仅为中国男性成员生成摘要。

I tried redefining using the subset() function, and then running describeBy again, eg我尝试使用subset()函数重新定义,然后再次运行describeBy ,例如

chinese <- subset(maindata, chinese=1)
describeBy(chinese, male=1)

But this didn't work, and the results were the same as describeBy(maindata,chinese=1) , rather than the Chinese male subset.但这不起作用,结果与describeBy(maindata,chinese=1) ,而不是中国男性子集。

I hope that makes sense.我希望这是有道理的。

The only other solution I can think of is to breakdown my main dataset into smaller ones in MS Excel and re-uploading each separately (eg Chinese.xls, Indian.xls), or to create a new variable with defined by a combination of ethnicity-gender, eg Chinesemale=1, Chinesefemale=2, Indianmale=3, Indianfemale=4.我能想到的唯一其他解决方案是在 MS Excel 中将我的主要数据集分解为较小的数据集并分别重新上传每个数据集(例如 Chinese.xls、Indian.xls),或者创建一个由种族组合定义的新变量-性别,例如中国男=1,中国女=2,印度男=3,印度女=4。

I more or less will need to analyse by these subgroups of subgroups for t-tests and Fisher's exact, so any good package recommendations that would help address these would be appreciated!我或多或少需要通过这些子组的子组进行 t 检验和 Fisher 的精确分析,因此任何有助于解决这些问题的好的软件包建议将不胜感激!

Thanks in advance!!提前致谢!!

Sample Data样本数据

df1 <- data.frame(subject = c(1, 2, 3, 4, 5),
                  chinese = c(1, 1, 1, 0, 0),
                  male = c(1, 0, 1, 0, 1),
                  value = c(45, 23, 84, 11, 12))

Two changes in syntax from your code:代码中的两个语法更改:

  • double equal sign in subset() . subset()双等号。 You want to keep rows where chinese is equal to 1. You would use a single equal sign if you were assigning a value of 1 to a parameter called chinese .您希望保留chinese等于 1 的行。如果您将值 1 分配给名为chinese的参数,您将使用单个等号。
  • In describeBy() , the group parameter gives you different summary statistics for each category in that column (as shown below).describeBy()group参数为您提供该列中每个类别的不同汇总统计信息(如下所示)。 You can't use it to subset for male=1.您不能使用它来为男性 = 1 设置子集。
chinese <- subset(df1, chinese == 1)

describeBy(chinese, group = "male")

 Descriptive statistics by group 
group: 0
        vars n mean sd median trimmed mad min max range skew kurtosis se
subject    1 1    2 NA      2       2   0   2   2     0   NA       NA NA
chinese    2 1    1 NA      1       1   0   1   1     0   NA       NA NA
male       3 1    0 NA      0       0   0   0   0     0   NA       NA NA
value      4 1   23 NA     23      23   0  23  23     0   NA       NA NA
------------------------------------------------------------------------------------------------------------------------------------- 
group: 1
        vars n mean    sd median trimmed   mad min max range skew kurtosis   se
subject    1 2  2.0  1.41    2.0     2.0  1.48   1   3     2    0    -2.75  1.0
chinese    2 2  1.0  0.00    1.0     1.0  0.00   1   1     0  NaN      NaN  0.0
male       3 2  1.0  0.00    1.0     1.0  0.00   1   1     0  NaN      NaN  0.0
value      4 2 64.5 27.58   64.5    64.5 28.91  45  84    39    0    -2.75 19.5

If you only want to see the summary stats for males in the sample, you could add & male == 1 to the subset() :如果您只想查看样本中男性的汇总统计数据,可以将& male == 1添加到subset()

chinese <- subset(df1, chinese == 1 & male == 1)

describeBy(chinese)

        vars n mean    sd median trimmed   mad min max range skew kurtosis   se
subject    1 2  2.0  1.41    2.0     2.0  1.48   1   3     2    0    -2.75  1.0
chinese    2 2  1.0  0.00    1.0     1.0  0.00   1   1     0  NaN      NaN  0.0
male       3 2  1.0  0.00    1.0     1.0  0.00   1   1     0  NaN      NaN  0.0
value      4 2 64.5 27.58   64.5    64.5 28.91  45  84    39    0    -2.75 19.5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM