[英]How to generate summary statistics (similar to psych::describeBy) for subgroups of subgroups, within a larger dataset?
New to R (for biostats) here! R 的新手(对于 biostats)在这里! I have a huge dataset, and am using
describe()
and describeBy()
from the psych package.我有一个巨大的数据集,正在使用 psych 包中的
describe()
和describeBy()
。 But I'm also trying to find a way to do basic stats for subgroups within subgroups.但我也在尝试找到一种方法来为子组中的子组进行基本统计。
For example, my dataset is about membership within a club, which has Chinese and Indian members.例如,我的数据集是关于一个俱乐部的会员资格,该俱乐部有中国和印度会员。 Other variables include gender, age, height, weight, BMI, etcetera.
其他变量包括性别、年龄、身高、体重、BMI 等。
I have figured out psych::describeBy
to look at means and standard deviation for subgroups defined by one variable, eg ethnicity, but I can't figure out how to narrow this down further so that I generate a summary only for Chinese male members.我已经找到了
psych::describeBy
来查看由一个变量(例如种族)定义的子组的均值和标准差,但我无法弄清楚如何进一步缩小范围,以便我仅为中国男性成员生成摘要。
I tried redefining using the subset()
function, and then running describeBy
again, eg我尝试使用
subset()
函数重新定义,然后再次运行describeBy
,例如
chinese <- subset(maindata, chinese=1)
describeBy(chinese, male=1)
But this didn't work, and the results were the same as describeBy(maindata,chinese=1)
, rather than the Chinese male subset.但这不起作用,结果与
describeBy(maindata,chinese=1)
,而不是中国男性子集。
I hope that makes sense.我希望这是有道理的。
The only other solution I can think of is to breakdown my main dataset into smaller ones in MS Excel and re-uploading each separately (eg Chinese.xls, Indian.xls), or to create a new variable with defined by a combination of ethnicity-gender, eg Chinesemale=1, Chinesefemale=2, Indianmale=3, Indianfemale=4.我能想到的唯一其他解决方案是在 MS Excel 中将我的主要数据集分解为较小的数据集并分别重新上传每个数据集(例如 Chinese.xls、Indian.xls),或者创建一个由种族组合定义的新变量-性别,例如中国男=1,中国女=2,印度男=3,印度女=4。
I more or less will need to analyse by these subgroups of subgroups for t-tests and Fisher's exact, so any good package recommendations that would help address these would be appreciated!我或多或少需要通过这些子组的子组进行 t 检验和 Fisher 的精确分析,因此任何有助于解决这些问题的好的软件包建议将不胜感激!
Thanks in advance!!提前致谢!!
Sample Data样本数据
df1 <- data.frame(subject = c(1, 2, 3, 4, 5),
chinese = c(1, 1, 1, 0, 0),
male = c(1, 0, 1, 0, 1),
value = c(45, 23, 84, 11, 12))
Two changes in syntax from your code:代码中的两个语法更改:
subset()
. subset()
双等号。 You want to keep rows where chinese
is equal to 1. You would use a single equal sign if you were assigning a value of 1 to a parameter called chinese
.chinese
等于 1 的行。如果您将值 1 分配给名为chinese
的参数,您将使用单个等号。describeBy()
, the group
parameter gives you different summary statistics for each category in that column (as shown below).describeBy()
, group
参数为您提供该列中每个类别的不同汇总统计信息(如下所示)。 You can't use it to subset for male=1.chinese <- subset(df1, chinese == 1)
describeBy(chinese, group = "male")
Descriptive statistics by group
group: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 1 2 NA 2 2 0 2 2 0 NA NA NA
chinese 2 1 1 NA 1 1 0 1 1 0 NA NA NA
male 3 1 0 NA 0 0 0 0 0 0 NA NA NA
value 4 1 23 NA 23 23 0 23 23 0 NA NA NA
-------------------------------------------------------------------------------------------------------------------------------------
group: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 2 2.0 1.41 2.0 2.0 1.48 1 3 2 0 -2.75 1.0
chinese 2 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
male 3 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
value 4 2 64.5 27.58 64.5 64.5 28.91 45 84 39 0 -2.75 19.5
If you only want to see the summary stats for males in the sample, you could add & male == 1
to the subset()
:如果您只想查看样本中男性的汇总统计数据,可以将
& male == 1
添加到subset()
:
chinese <- subset(df1, chinese == 1 & male == 1)
describeBy(chinese)
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 2 2.0 1.41 2.0 2.0 1.48 1 3 2 0 -2.75 1.0
chinese 2 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
male 3 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
value 4 2 64.5 27.58 64.5 64.5 28.91 45 84 39 0 -2.75 19.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.