简体   繁体   English

如何在R中构造和重新编码凌乱的分类数据?

[英]How can I structure and recode messy categorical data in R?

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean. 我正在努力解决如何最好地构建混乱的分类数据,并且来自我需要清理的数据集

The Coding Scheme 编码方案

I'm analyzing data from a university science course exam. 我正在分析大学科学课程考试的数据。 We're looking at patterns in student responses, and we developed a coding scheme to represent the kinds of things students are doing in their answers. 我们正在研究学生反应中的模式,并且我们开发了一种编码方案来表示学生在答案中所做的事情。 A subset of the coding scheme is shown below. 编码方案的子集如下所示。

Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...). 注意,在每个主要代码(1,2,3)内是嵌套的非唯一子代码(a,b,...)。

What the Raw Data Looks Like 原始数据看起来像什么

I've created an anonymized, raw subset of my actual data which you can view here . 我已经创建了一个我的实际数据的匿名原始子集,您可以在此处查看。 Part of my problem is that those who coded the data noticed that some students displayed multiple patterns. 我的部分问题是编码数据的人注意到一些学生显示了多种模式。 The coders' solution was to create enough columns ( reason1 , reason2 , ...) to hold students with multiple patterns. 该编码器的解决方案是建立足够的列( reason1reason2 ,...),让学生理解使用多种模式。 That becomes important because the order ( reason1 , reason2 ) is arbitrary--two students (like student 41 and student 42 in my dataset ) who correctly applied "dependency" should both register in an analysis, regardless of whether 3a appears in the reason column or the reason2 column. 这变得很重要,因为该命令( reason1reason2 )是任意的-两个学生(如学生41和我的学生42 )谁正确应用“依赖”都应该在分析中注册,无论3a出现在reason列或reason2列。

How Can I Best Structure Student Data? 我如何才能最好地构建学生数据?

Part of my problem is that in the raw data , not all students display the same patterns, or the same number of them, in the same order. 我的部分问题是,在原始数据中 ,并非所有学生都以相同的顺序显示相同的模式或相同数量的模式。 Some students may do just one thing, others may do several. 有些学生可能只做一件事,有些可能做几件。 So, an abstracted representation of example students might look like this: 因此,示例学生的抽象表示可能如下所示:

Note in the example above that student002 and student003 both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of my data . 请注意,在上面的示例中, student002student003都被编码为“1b”,尽管我故意将顺序显示为不同以反映我的数据的实际情况。

My (Practical) Questions 我的(实际)问题

  1. Should I concatenate reason1 , reason2 , ... into one column? 我应该串联reason1reason2...成一列?
  2. How can I (re)code the reason s in R to reflect the multiplicity for some students? 我如何(重新)编码R中的reason以反映某些学生的多样性?

Thanks 谢谢

I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. 我意识到这个问题与良好的数据概念化同样重要,因为它与R的特定功能有关,但我认为在这里提出它是合适的。 If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. 如果您觉得我提出这个问题是不合适的,请在评论中告诉我,stackoverflow会自动使用sadface表情符号填充我的收件箱。 If I haven't been specific enough, please let me know and I'll do my best to be clearer. 如果我不够具体,请告诉我,我会尽力让自己更清楚。

Make it "long": 让它“长”:

library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results

What to do next will depend on the kind of analysis you would like to do. 下一步做什么将取决于您想要做的分析类型。 But the long format is the useful for irregular data such as yours. 但是长格式对于像你这样的不规则数据很有用。

you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. 你应该使用plyr中的ddply并在所有列上拆分,如果你想考虑不同的原因,如果你想忽略它们,请不要在拆分中使用这些列。 You'll need to clean up some of the question marks and extra stuff first though. 你需要首先清理一些问号和额外的东西。

x <- ddply(data, c("split_column1", "split_column3" etc),
           summarize(result_df, stats you want from result_df))

What's the (bigger picture) question you're attempting to answer? 您试图回答的(大图)问题是什么? Why is this information interesting to you? 为什么这些信息对你有意义?

Are you just trying to find patterns such as 'if the student does this, then they also likely do this'? 您是否只是想找到“如果学生这样做,那么他们也可能会这样做”的模式?

Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives. 如果是这种情况我会考虑的事情 - 将数据集拆分为较小的随机样本以进行分析,以降低误报的风险。

Interesting problem though! 虽然有趣的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM