I am new to R. I am trying to learn the best way to go about aggregating some data in different ways. I have some programming experience, but I'm not super comfortable with R's syntax just yet.
I have a large data frame containing measures from a reading time experiment, in a similar format to the made-up snippet below. Each row represents an individual measure with descriptive information about it. Each participant occupies many rows in the data frame, and each row represents a different experimental item:
| Participant | Item | Type | Condition1 | Condition2 | rtMeasure | list |
|-------------|------|------|------------|------------|-----------|---------|
| 10059 | 215 | Q | FALSE | TRUE | 4215.591 | qiList2 |
| 10059 | 113 | F | FALSE | FALSE | 3472.066 | qiList2 |
| 10059 | 9 | B | FALSE | FALSE | 4201.406 | qiList2 |
| 10059 | 303 | W | FALSE | TRUE | 3619.791 | qiList2 |
| 10060 | 215 | Q | FALSE | TRUE | 4985.057 | qiList2 |
| 10060 | 113 | F | FALSE | FALSE | 3247.489 | qiList2 |
| 10060 | 9 | C | TRUE | FALSE | 2543.65 | qiList2 |
| 10060 | 303 | W | FALSE | TRUE | 3194.199 | qiList2 |
| 10061 | 215 | Q | FALSE | TRUE | 2885.469 | qiList2 |
| 10061 | 113 | F | FALSE | FALSE | 5901.188 | qiList2 |
| 10061 | 9 | D | FALSE | TRUE | 3326.375 | qiList2 |
| 10061 | 303 | W | FALSE | TRUE | 3194.199 | qiList2 |
| 10062 | 215 | Q | FALSE | TRUE | 2885.469 | qiList2 |
| 10062 | 113 | F | FALSE | FALSE | 5901.188 | qiList2 |
| 10062 | 9 | A | TRUE | TRUE | 3326.375 | qiList2 |
| 10062 | 303 | W | FALSE | TRUE | 3194.199 | qiList2 |
The columns are briefly described below:
Participant
: a number point to an individual subject Item
: the item that was being presented when this measure was recorded, ie item number Type
: this is descriptive of the sentence, sometimes redundant.
Q, F, W
: filler items, these are redundant with item number A, B, C, D
: different versions of experimentally manipulated items, ie a participant might see 11A and would therefore not see 11B 11C or 11D. Condition1 & Condition 2
: Redundant. More explicitly descriptive encoding of the manipulation also encoded in the Type column (eg Bs are -Condition1, -Condition2; Cs are +Condition1, -Condition2) rtMeasure
: the actually measure (in this case, reading time in ms). List
: Redundant (maps Type
to Participant
). The list presented to the subject. I would like to discover, for example, a given participant's mean rtMeasure
for type A
& B
items. I'd also like a given participant's overall mean rtMeasure
. I'd also like to do see similar exploratory values for sentence types across particpants.
It seems like it would likely be easier to do the above if I were to restructure my data frame to something like Participant by (Item+Type) and the transposed version of this. That is:
| Participant | rtMeasure(Item 1, Type A) | rtMeasure(Item 1, Type B) | ... | rtMeasure(Item 323, Type W) |
|-------------|---------------------------|---------------------------|-----|-----------------------------|
| 12345 | 3343.334 | NA | ... | 2342.115 |
| 12346 | NA | 3343.334 | ... | 2145.23 |
| 12346 | NA | NA | ... | 2511.12 |
And transposed:
| Participant | 12345 | 12346 | ... | 12400 |
|---------------------------|--------|--------|-----|--------|
| rtMeasure(Item 1, Type A) | 2341.2 | NA | ... | 1903.6 |
| rtMeasure(Item 1, Type B) | NA | 3012.4 | ... | NA |
It seems like the plyr package can probably do what I need, but I am unclear as to how to attack it.
Would I use a function like this?
I could see the solution being a custom function of some similarity to my attempt below, but I don't know how to translate this to R... I'm most comfortable with JavaScript syntax, so I will approximate that, but imagining I have an R dataframe to work with.
// assume data is the dataframe at the start of this post
var participants = valuesOf(data$Participant);
var matrix = []
for (participantId in participants) {
var participant = {};
participant.id = participantId;
for (measure in dataframe[data$participant === participantId]) {
measureLabel = measure.Item + ' ' + measure.Type;
participant[measureLabel] = measure.rtMeasure;
}
matrix.push(participant);
}
After the above code executes, I would expect matrix
to be an array of participant
objects, where the properties are measures, labeled by " Item Type
"
As per Frank 's suggestion, I attempted to create a MCVE . As he hinted might happen, I found the answer I was looking for by forcing myself to actually read through the somewhat intimidating tutorial for the plyr package: The Split-Apply-Combine Strategy for Data Analysis .
I also found Summarizing data in http://www.cookbook-r.com/ to be helpful.
Basically I discovered how to use ddply, the plyr function for aggregating data frames into different data frames.
In my original question I asked how to look at
I'm going to outline how I did each thing in case someone else will find it useful.
First, load some made up data:
> df <- read.csv('df.csv')
> df
participants items types condition1 condition2 rtMeasures
1 1001 101 F FALSE TRUE 3852.823
2 1001 213 Q TRUE TRUE 2499.445
3 1001 1 C FALSE FALSE 2811.198
4 1001 312 W TRUE TRUE 2200.470
5 1001 113 F TRUE FALSE 2419.663
6 1002 101 F FALSE TRUE 1833.647
7 1002 213 Q TRUE TRUE 2381.160
8 1002 1 B FALSE FALSE 2415.385
9 1002 312 W TRUE TRUE 2788.386
10 1002 113 F TRUE FALSE 2665.298
The first one is easy.
Use ddply to get each participant's mean rtMeasure:
> ddply(df, .(participants), summarize, mean=mean(rtMeasures), N=length(participants));
participants mean N
1 1001 2756.720 5
2 1002 2416.775 5
The second is a little trickier. There's probably a better way, but for a quick and dirty solution, this works.
Use ddply to get each participant's mean rtMeasure for each Type:
> ddply(df, .(participants, "is type Q or W"=(types %in% c('Q', 'W'))), summarize, mean=mean(rtMeasures), N=length(participants));
participants is type Q or W mean N
1 1001 FALSE 3027.895 3
2 1001 TRUE 2349.958 2
3 1002 FALSE 2304.777 3
4 1002 TRUE 2584.773 2
To be clear, I am carving up my data based on whether or not the "type" of measure is Q or W. So for my example, the rows where "is type Q or W" column lists FALSE
show means for that participant of ABCDF-type measures; where that column is TRUE the row represents the mean of QW-type measures. In my actually data, these "types" are already binary-coded, so it shouldn't be as messy.
And it's just as easy to group by items
or condition1
or any other descriptor in your data frame.
> ddply(df, .(items, types), summarize, mean=mean(rtMeasures), N=length(participants));
items types mean N
1 1 B 2415.385 1
2 1 C 2811.198 1
3 101 F 2843.235 2
4 113 F 2542.481 2
5 213 Q 2440.302 2
6 312 W 2494.428 2
Getting fancy...
> ddply(df, .(Context=(condition1==FALSE & condition2==FALSE)), summarize, mean=mean(rtMeasures), N=length(participants));
Context mean N
1 FALSE 2580.112 8
2 TRUE 2613.291 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.