R, Regex, and Matching the Choice of a Qualtrics Response Column

Question

When you export response data from Qualtrics as a CSV, the 2nd row of the data contains strings with the question stem (shortened if necessary), followed by a dash, followed by that response column's corresponding choice. As an example, if my question were "Please select all of the fruit you enjoy:", in my response data the second row of a response column to this question might contain something like "Please select all of the fruit you enjoy:-Blueberries".

Qualtrics shortens the question stem if it is longer than 100 characters. If it is more than 100 characters, the stem is cut off after the 99th character, "..." is appended, and then the dash, and then the choice text.

I am trying to retrieve the text that is after this dash. However, that's difficult, because both the choice text and the question text could contain dashes. I have thought of two different approaches I could take in attempting to select just the choice text:

I have the question text, and can reliably programmatically retrieve it based on the response column name. However, the question text doesn't always match exactly, because Qualtrics removes any HTML styling in the Question text in the response data, but not in the Qualtrics survey file that I am getting the question text from. For questions that don't have any HTML styling, I was thinking about trying to use the question text to somehow match up to and including the dash between the question text and the choice text. I think regex could handle this case fine, but this clearly doesn't work without heavy modification for any questions that have HTML components.
The alternative I think might be more reliable. Strip the question text from the QSF file of any HTML tags, and then count how many "-" characters appear in the question text. Call that n , and then match the 2nd-row-response-entry for up to the n+1 th dash, remove it, and what's remaining is my choice text.

I think the 2nd option is much more likely to work consistently, since the first option leaves me with a case where I have to try and strip html from the question text in exactly the same way Qualtrics does, unless I use fuzzy matching (which I know nothing about). However, the second option is also unclear to me.

an example csv response set

For example, the first question's question text looks like this in the QSF:

"<div style=\"text-align: center;\">Click to write the question text
<span style=\"font-size: 10.8333px;\">thsi<sup>tasdf<em>werasfd</em></sup>
<em>sdfad</em></span><br />\n&nbsp;</div>"

I would appreciate both of the following: advice on which option (or a suggestion for another) you think has the most chance for success, and help with the regex in R for matching the text up to the n+1 th "-" character.

Answer 1

Here's a solution that counts the dashes in the question, locates the nth dash in the text (if any) and drops the preceding characters, and then keeps the substring that follows the next dash in the text.

stem_text <- "Please--select your extracurriculars"
s <- "<em>Please</em>--select your extracurriculars-student-athletics"

# count dashes in question stem
stem_dash_n <- length(gregexpr("-", stem_text)[[1]])
# locate dashes in string
s_dashes <- gregexpr("-", s)[[1]]

sub_start <- ifelse(length(s_dashes), s_dashes[stem_dash_n], 1)
s_sub <- substr(s, sub_start + 1, nchar(s))
sub("[^\\-]*\\-(.*)", "\\1", s_sub, perl = TRUE)
# [1] "student-athletics"

Assumptions: based on your description, length(s_dashes) >= stem_dash_n , so s_dashes[stem_dash_n] exists; the same number of dashes appear in the known stems and their representations in the text; and there is always a dash separating the stem and response choice.

R, Regex, and Matching the Choice of a Qualtrics Response Column

Question

1 answers

solution1
1 ACCPTED 2016-06-10 18:43:02

R, Regex, and Matching the Choice of a Qualtrics Response Column

Question

1 answers

solution1 1 ACCPTED 2016-06-10 18:43:02

solution1
1 ACCPTED 2016-06-10 18:43:02