简体   繁体   中英

Remove specific last characters from a column name in R

I need help in either removing the last few characters from a column name if they meet a certain criteria or tweaking my current code to just do it from the start.

I am working with student test data on Common Core assessments and the column names don't follow a consistent format. The data frame is structured as such:

>names(df)
[1] Student.ID
[2] State.ID
[3] "X2.MD.A.1.Select.and.Use.Appropriate.Tools.to.Measure.Length.Percent.Correct"
[4] "X2.MD.A.3.Estimate.Length.Percent.Correct"                                   
[5] "X2.MD.A.4.Measurement.Difference.Percent.Correct"                             
[6] "X2.MD.B.5.Addition.and.Subtraction.Word.Problems..Lengths.Percent.Correct"   
[7] "X2.NBT.A.1.Understand.Place.Value.Percent.Correct"                            
[8] "X2.NBT.A.1.a.Understand.Place.Value..Bundles.of.Tens.Percent.Correct"        
[9] "X2.NBT.A.1.b.Understand.Place.Value..Bundles.of.Hundreds.Percent.Correct"     
[10] "X2.NBT.A.3.Read.and.Write.Numbers.to.1.000.Percent.Correct"   

This is my desired result:

>name(df)
[1] Student.ID
[2] State.ID
[3] A1_2.MD.A.1
[4] A1_2.MD.A.3
[5] A1_2.MD.A.4
[6] A1_2.MD.B.5
[7] A1_2.NBT.A.1
[8] A1_2.NBT.A.1.a
[9] A1_2.NBT.A.1.b
[10] A1_2.NBT.A.3

This is the code I have so far, but it's only getting me part of the way:

library(reshape2)
library(reshape)
library(stringr)
library(dplyr)
library(qdap)

for (column in c(3:ncol(df))) {
  colnames(df)[column] <- substr(colnames(df[column],4,nchar(colnames(df)[column]))
}

## reduce column names to only the letter and number (strip the description)
for (column in c(3:ncol(df))) {
if (nchar(beg2char(colnames(df)[column],".")) < 3) {
  colnames(df)[column] <- substr(colnames(df[column],1,8)
  } else if (nchar(beg2char(colnames(df)[column],".")) > 2){
  colnames(df)[column] <- substr(colnames(df)[column],1,9)
  }
}
## add screening number indicator to start of percent scores
for (column in c(3:ncol(df))) {
  colnames(df)[column] <- paste("A1_2", colnames(df)[column], sep=".")
}

Now I'm getting:

>name(df)
[1] Student.ID
[2] State.ID
[3] A1_2.MD.A.1.S
[4] A1_2.MD.A.3.E
[5] A1_2.MD.A.4.M
[6] A1_2.MD.B.5.A
[7] A1_2.NBT.A.1.U
[8] A1_2.NBT.A.1.a
[9] A1_2.NBT.A.1.b
[10] A1_2.NBT.A.3.R

Thanks in advance for your help!

You can use

names <- c(your_col_names_here)
names <- gsub("^X2\\.((?:[^.]+\\.){2}[^.]+(?:\\.[a-z])?).*",
              "A1_2.\\1", names)
names(df) <- names

See a demo on regex101.com .


As whole R snippet:

 # create a dummy df to test with df <- as.data.frame(matrix(0, ncol = 10, nrow = 1)) names <- c("Student.ID", "State.ID", "X2.MD.A.1.Select.and.Use.Appropriate.Tools.to.Measure.Length.Percent.Correct", "X2.MD.A.3.Estimate.Length.Percent.Correct", "X2.MD.A.4.Measurement.Difference.Percent.Correct", "X2.MD.B.5.Addition.and.Subtraction.Word.Problems..Lengths.Percent.Correct", "X2.NBT.A.1.Understand.Place.Value.Percent.Correct", "X2.NBT.A.1.a.Understand.Place.Value..Bundles.of.Tens.Percent.Correct", "X2.NBT.A.1.b.Understand.Place.Value..Bundles.of.Hundreds.Percent.Correct", "X2.NBT.A.3.Read.and.Write.Numbers.to.1.000.Percent.Correct") names(df) <- gsub(pattern = "^X2\\\\.((?:[^.]+\\\\.){2}[^.]+(?:\\\\.[az])?).*", "A1_2.\\\\1", names) df 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM