简体   繁体   中英

Aggregate columns on a wildcard in R

I am looking at real estate data which captures the number of days a home is listed. In the data ( link ), you will see that there are columns which denote year and month in the form YYYY.MM. When I import this data into R, the columns are listed in the same manner, but with an 'X' in front (XYYYY.MM). Ideally, I would like to get the median number of days a home is listed for each year (eg 2010.01 through 2010.12) for each year in the data. Using the previous example, I would take the median across the columns 2010.01 through 2010.12 and have a resulting variable named '2010.median.days.listed' for each year. Is there a good way of doing this in R?

You could try the following code:

dta <- read.csv("http://files.zillowstatic.com/research/public/State/DaysOnZillow_Public_State.csv")
require(reshape2)
dta <- melt(dta, id.vars = c(1:5))
dta$year <- substr(dta$variable, 2, 5)

dta_results <- aggregate(dta$value, FUN = mean, list(dta$year))

First you get your data to a long format then you select year or whatever else you want like year + state and get your table of means, sums or whatever else by any combination of grouping factors (year / year + state, etc.):

> head(dta_results)
  Group.1        x
1    2010 128.0370
2    2011 126.1191
3    2012 122.5372
4    2013 109.1042
5    2014 102.4921
6    2015       NA

There are almost certainly more elegant ways to do this, but for a quick-fix, you can easily subset all the columns representing a given year using R's grepl functionality, eg:

dataURL = "http://files.zillowstatic.com/research/public/State/DaysOnZillow_Public_State.csv"
data = read.csv(dataURL)

year = 2010

cols = data[, grepl(year, names(data)) ]  # select columns of the data whose
                                         # column name contains the pattern
                                         # given in the variable "year", here
                                         # "2010"

I am assuming you want a median value for each row from among these 12 columns (eg the second row of your desired "2010.median.days.listed" column would contain the median of the 12 "Alaska" values from 2010). Is that correct?

If so, you can then use apply : apply(cols, 1, median) . This takes the function median and applies it to each row of cols . The second argument ( 1 ) indicates that we wish to apply the function row-wise.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM