简体   繁体   中英

Using melt() to convert wide to long data format that requires value lookup

I am having difficulty figuring out how to convert some wide data into long format. I have three columns of string data ( A1_R00_FillerNP , A1_R01_ADV , and A1_R02_1stEmbV ) which I would like to melt into one column ( WordCountRegion ) in such a way that for each Subject and item the correct word will be mapped from one of these three columns to the new, WordCountRegion column.

Using a simple melt approach as in the code below gets me part of the way there:

(Note: the strange characters in the df are inconsequential - please ignore them here)

df <- structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L), condition = structure(c(2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L), .Label = c("P", "R", 
"S"), class = "factor"), item = c(101L, 102L, 103L, 101L, 102L, 
103L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 
102L, 103L), A1_R00_FillerNP = structure(c(3L, 2L, 1L, 3L, 2L, 
1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L), .Label = c("SÌÇna d_r allvarliga konsekvenser", 
"SÌÇna d_r fina _ppeltr_d", "SÌÇna d_r gamla skottk_rror"
), class = "factor"), A1_R01_ADV = structure(c(1L, 1L, 2L, 1L, 
1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("alltid", 
"f_rresten"), class = "factor"), A1_R02_1stEmbV = structure(c(3L, 
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 
1L), .Label = c("diskuterade", "stod", "tv_ttade"), class = "factor"), 
    RT = c(0L, 149L, 247L, 272L, 171L, 245L, 317L, 0L, 233L, 
    0L, 981L, 750L, 272L, 171L, 334L, 317L, 0L, 233L), Region = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 
    3L, 3L), .Label = c("R00", "R01", "R02"), class = "factor"), 
    RegionType = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 
    1L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("1stEmbV", 
    "ADV", "FillerNP"), class = "factor"), DV = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L), .Label = c("FIRST_FIXATION_DURATION", "GAZE_DURATION"
    ), class = "factor")), .Names = c("Subject", "condition", 
"item", "A1_R00_FillerNP", "A1_R01_ADV", "A1_R02_1stEmbV", "RT", 
"Region", "RegionType", "DV"), class = "data.frame", row.names = c(NA, 
-18L))

df1 = melt(df, measure.vars = c("A1_R00_FillerNP","A1_R01_ADV","A1_R02_1stEmbV"), var = "WordCountRegion")

The problem is that this code incorrectly breaks the words across regions. I end up with output like the following, where words do not break as specified by Region and instead extend across values of Region , as can be seen by WordCountRegion and value . It is clear that if I am going to use this, then I need some sort of additional specification so that melt() will be able to break the data correctly. I'm just not sure how to do this (or if it can be done within melt()).

   Subject condition item  RT Region RegionType                      DV WordCountRegion                             value
1      101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
2      101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
3      101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
4      101         R  101 272    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
5      101         P  102 171    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
6      101         S  103 245    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
7      101         R  101 317    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
8      101         P  102   0    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
9      101         S  103 233    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
10     101         R  101   0    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
11     101         P  102 981    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
12     101         S  103 750    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
13     101         R  101 272    R01        ADV           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
14     101         P  102 171    R01        ADV           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
15     101         S  103 334    R01        ADV           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
16     101         R  101 317    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
17     101         P  102   0    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
18     101         S  103 233    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
19     101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                            alltid
20     101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                            alltid
21     101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                         f_rresten

Is there a way that I could modify melt() to get these to line up/match by Region , as in the sample below:

   Subject condition item  RT Region RegionType                      DV WordCountRegion                             value
1      101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
2      101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
3      101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
4      101         R  101 272    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                                 alltid
5      101         P  102 171    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                                 alltid
6      101         S  103 245    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                              f_rresten
7      101         R  101 317    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                           tv_ttade
8      101         P  102   0    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                               stod
9      101         S  103 233    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                        diskuterade
10     101         R  101   0    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
11     101         P  102 981    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
12     101         S  103 750    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser

Or, if I am using the wrong function altogether, could someone please point me towards a better solution? Perhaps I need something that does actual lookups?

You could create a little lookup table, merge it in, then use it to filter your melted dataframe, and I believe this gives you the result you're looking for.

region_df <- data.frame(var = c("A1_R00_FillerNP","A1_R01_ADV","A1_R02_1stEmbV"), 
  Region = c('R00','R01','R02'))

df2 <- merge(df1, region_df)
df3 <- subset(df2, var==WordCountRegion)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM