简体   繁体   中英

Aggregating complex dataframe with R (for a beginner)

I am new to R. I am trying to learn the best way to go about aggregating some data in different ways. I have some programming experience, but I'm not super comfortable with R's syntax just yet.

My data now:

I have a large data frame containing measures from a reading time experiment, in a similar format to the made-up snippet below. Each row represents an individual measure with descriptive information about it. Each participant occupies many rows in the data frame, and each row represents a different experimental item:

| Participant | Item | Type | Condition1 | Condition2 | rtMeasure | list    |
|-------------|------|------|------------|------------|-----------|---------|
| 10059       | 215  | Q    | FALSE      | TRUE       | 4215.591  | qiList2 |
| 10059       | 113  | F    | FALSE      | FALSE      | 3472.066  | qiList2 |
| 10059       | 9    | B    | FALSE      | FALSE      | 4201.406  | qiList2 |
| 10059       | 303  | W    | FALSE      | TRUE       | 3619.791  | qiList2 |
| 10060       | 215  | Q    | FALSE      | TRUE       | 4985.057  | qiList2 |
| 10060       | 113  | F    | FALSE      | FALSE      | 3247.489  | qiList2 |
| 10060       | 9    | C    | TRUE       | FALSE      | 2543.65   | qiList2 |
| 10060       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |
| 10061       | 215  | Q    | FALSE      | TRUE       | 2885.469  | qiList2 |
| 10061       | 113  | F    | FALSE      | FALSE      | 5901.188  | qiList2 |
| 10061       | 9    | D    | FALSE      | TRUE       | 3326.375  | qiList2 |
| 10061       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |
| 10062       | 215  | Q    | FALSE      | TRUE       | 2885.469  | qiList2 |
| 10062       | 113  | F    | FALSE      | FALSE      | 5901.188  | qiList2 |
| 10062       | 9    | A    | TRUE       | TRUE       | 3326.375  | qiList2 |
| 10062       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |

The columns are briefly described below:

  • Participant : a number point to an individual subject
  • Item : the item that was being presented when this measure was recorded, ie item number
  • Type : this is descriptive of the sentence, sometimes redundant.
    • Q, F, W : filler items, these are redundant with item number
    • A, B, C, D : different versions of experimentally manipulated items, ie a participant might see 11A and would therefore not see 11B 11C or 11D.
  • Condition1 & Condition 2 : Redundant. More explicitly descriptive encoding of the manipulation also encoded in the Type column (eg Bs are -Condition1, -Condition2; Cs are +Condition1, -Condition2)
  • rtMeasure : the actually measure (in this case, reading time in ms).
  • List : Redundant (maps Type to Participant ). The list presented to the subject.

What I want to get (exploratory values):

I would like to discover, for example, a given participant's mean rtMeasure for type A & B items. I'd also like a given participant's overall mean rtMeasure . I'd also like to do see similar exploratory values for sentence types across particpants.


Do I want to transform to matrices?

It seems like it would likely be easier to do the above if I were to restructure my data frame to something like Participant by (Item+Type) and the transposed version of this. That is:

| Participant | rtMeasure(Item 1, Type A) | rtMeasure(Item 1, Type B) | ... | rtMeasure(Item 323, Type W) |
|-------------|---------------------------|---------------------------|-----|-----------------------------|
| 12345       | 3343.334                  | NA                        | ... | 2342.115                    |
| 12346       | NA                        | 3343.334                  | ... | 2145.23                     |
| 12346       | NA                        | NA                        | ... | 2511.12                     |

And transposed:

| Participant               | 12345  | 12346  | ... | 12400  |
|---------------------------|--------|--------|-----|--------|
| rtMeasure(Item 1, Type A) | 2341.2 | NA     | ... | 1903.6 |
| rtMeasure(Item 1, Type B) | NA     | 3012.4 | ... | NA     |

It seems like the plyr package can probably do what I need, but I am unclear as to how to attack it.


Would I use a function like this?

I could see the solution being a custom function of some similarity to my attempt below, but I don't know how to translate this to R... I'm most comfortable with JavaScript syntax, so I will approximate that, but imagining I have an R dataframe to work with.

// assume data is the dataframe at the start of this post

var participants = valuesOf(data$Participant);
var matrix = []

for (participantId in participants) {
  var participant = {};
  participant.id = participantId;
  for (measure in dataframe[data$participant === participantId]) {
    measureLabel = measure.Item + ' ' + measure.Type;
    participant[measureLabel] = measure.rtMeasure;
  }
  matrix.push(participant);
}

After the above code executes, I would expect matrix to be an array of participant objects, where the properties are measures, labeled by " Item Type "

As per Frank 's suggestion, I attempted to create a MCVE . As he hinted might happen, I found the answer I was looking for by forcing myself to actually read through the somewhat intimidating tutorial for the plyr package: The Split-Apply-Combine Strategy for Data Analysis .

I also found Summarizing data in http://www.cookbook-r.com/ to be helpful.

Basically I discovered how to use ddply, the plyr function for aggregating data frames into different data frames.

In my original question I asked how to look at

  • a given participant's mean rtMeasure
  • a given participant's mean rtMeasure for type A & B item
  • similar exploratory values for sentence types across participants

I'm going to outline how I did each thing in case someone else will find it useful.

First, load some made up data:

> df <- read.csv('df.csv')
> df
   participants items types condition1 condition2 rtMeasures
1          1001   101     F      FALSE       TRUE   3852.823
2          1001   213     Q       TRUE       TRUE   2499.445
3          1001     1     C      FALSE      FALSE   2811.198
4          1001   312     W       TRUE       TRUE   2200.470
5          1001   113     F       TRUE      FALSE   2419.663
6          1002   101     F      FALSE       TRUE   1833.647
7          1002   213     Q       TRUE       TRUE   2381.160
8          1002     1     B      FALSE      FALSE   2415.385
9          1002   312     W       TRUE       TRUE   2788.386
10         1002   113     F       TRUE      FALSE   2665.298

The first one is easy.

Use ddply to get each participant's mean rtMeasure:

> ddply(df, .(participants), summarize, mean=mean(rtMeasures), N=length(participants));
  participants     mean N
1         1001 2756.720 5
2         1002 2416.775 5

The second is a little trickier. There's probably a better way, but for a quick and dirty solution, this works.

Use ddply to get each participant's mean rtMeasure for each Type:

> ddply(df, .(participants, "is type Q or W"=(types %in% c('Q', 'W'))), summarize, mean=mean(rtMeasures), N=length(participants));
  participants is type Q or W     mean N
1         1001          FALSE 3027.895 3
2         1001           TRUE 2349.958 2
3         1002          FALSE 2304.777 3
4         1002           TRUE 2584.773 2

To be clear, I am carving up my data based on whether or not the "type" of measure is Q or W. So for my example, the rows where "is type Q or W" column lists FALSE show means for that participant of ABCDF-type measures; where that column is TRUE the row represents the mean of QW-type measures. In my actually data, these "types" are already binary-coded, so it shouldn't be as messy.


And it's just as easy to group by items or condition1 or any other descriptor in your data frame.

> ddply(df, .(items, types), summarize, mean=mean(rtMeasures), N=length(participants));
  items types     mean N
1     1     B 2415.385 1
2     1     C 2811.198 1
3   101     F 2843.235 2
4   113     F 2542.481 2
5   213     Q 2440.302 2
6   312     W 2494.428 2

Getting fancy...

> ddply(df, .(Context=(condition1==FALSE & condition2==FALSE)), summarize, mean=mean(rtMeasures), N=length(participants));
  Context     mean N
1   FALSE 2580.112 8
2    TRUE 2613.291 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM