简体   繁体   中英

R: changing column names for improved documentation

I have two csv files. One containing measurements at several points and one containing the description of the single points. It has about a 100 different points and 10000's of measurements but for simplification let's assume there are only two points and measurements.

data.csv:

point1,point2,date
25,80,11.06.2013
26,70,10.06.2013

description.csv:

point,name,description
point1,tempA,Temperature in room A
point2,humidA,Humidity in room A

Now I read both of the csv's into dataframes. Then I change the column names in the dataframe to make it more readable.

options(stringsAsFactors=F)
DataSource <- read.csv("data.csv")
DataDescription <- read.csv("description.csv")
for (name.source in names(DataSource)) 
{
  count = 1
  for (name.target in DataDescription$point) 
  {
    if (name.source == name.target) 
    {
      names(DataSource)[names(DataSource)==name.source] <- DataDescription[count,'name']  
    }
    count = count + 1
  }
}

So, my questions now are: Is there a way to do this without the loops? And would you change the names for readability as I did or not? If not, why?

The trick with replacements is sometimes to match the indexing on both sides of hte assignment:

names(DataSource)[match(DataDescription$point, names(DataSource))] <- 
   DataDescription$name[match(DataDescription$point, names(DataSource))]

#> DataSource
  tempA humidA       date
1    25     80 11.06.2013
2    26     70 10.06.2013

Earlier effort :

 names(DataSource)[match(DataDescription$point, names(DataSource))] <-
                gsub(" ", "_", DataDescription$description)[ 
                   match(DataDescription$point, names(DataSource))]

#> DataSource
  Temperature_in_room_A Humidity_in_room_A       date
1                    25                 80 11.06.2013
2                    26                 70 10.06.2013

Notice that I did not put non-syntactic names on that dataframe. To do so would have been a disservice. Anando Mahto's comment is well considered. I would not want to do this unless it were are the very end of data-processing or a side excursion on the way to a plotting effort. In that case I might not substitute the underscores. In the case where you wanted plotting lables there might be a further need for insertion of "\\n" to fold the text within space constraints.

ok, I ordered the columns in the first one and the rows in the second one to work around the problem with the same order of the points. Now the description only need to have the same points as the data source. Here is my final code:

# set options to get strings right
options(stringsAsFactors=F) 

# read in original data
DataOriginal <- read.csv("data.csv", sep = ";")
DataDescriptionOriginal <- read.csv("description.csv", sep = ";")

# sort the data
DataOrdered <- DataOriginal[,order(names(DataOriginal))]
DataDescriptionOrdered <- DataDescriptionOriginal[order(DataDescriptionOriginal$points),]

# copy data into final dataframe and replace names
Data <- DataOrdered
names(Data)[match(DataDescriptionOrdered$points, names(Data))] <- gsub(" ", "_", DataDescriptionOrdered$description)[match(DataDescriptionOrdered$points, names(Data))]

Thx a lot to everyone contributing to find a good solution for me!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM