简体   繁体   中英

R vectorized matrix into numeric data frame with preserving factors

i have matrices given by folloing way:

m <- as.matrix(rbind(c("State", "Murder", "Assault", "UrbanPop", "Rape", "Group"),
c("Alabama", 13.2, 236, 58, 21.2, "A"),
c("Alaska", 10.0, 263, 48, 44.5, "A"),
c("Arizona", 8.1, 294, 80, 31.0, "A"),
c("Arkansas", 8.8, 190, 50, 19.5, "A"),
c("California", 9.0, 276, 91, 40.6, "A"),
c("Colorado", 7.9, 204, 78, 38.7, "A"),
c("Connecticut", 3.3, 110, 77, 11.1, "A"),
c("Delaware", 5.9, 238, 72, 15.8, "A"),
c("Florida", 15.4, 335, 80, 31.9, "A"),
c("Georgia", 17.4, 211, 60, 25.8, "A"),
c("Hawaii", 5.3, 46, 83, 20.2, "A"),
c("Idaho", 2.6, 120, 54, 14.2, "A"),
c("Illinois", 10.4, 249, 83, 24.0, "A"),
c("Indiana", 7.2, 113, 65, 21.0, "A"),
c("Iowa", 2.2, 56, 57, 11.3, "A"),
c("Kansas", 6.0, 115, 66, 18.0, "A"),
c("Kentucky", 9.7, 109, 52, 16.3, "A"),
c("Louisiana", 15.4, 249, 66, 22.2, "A"),
c("Maine", 2.1, 83, 51, 7.8, "B"),
c("Maryland", 11.3, 300, 67, 27.8, "B"),
c("Massachusetts", 4.4, 149, 85, 16.3, "B"),
c("Michigan", 12.1, 255, 74, 35.1, "B"),
c("Minnesota", 2.7, 72, 66, 14.9, "B"),
c("Mississippi", 16.1, 259, 44, 17.1, "B"),
c("Missouri", 9.0, 178, 70, 28.2, "B"),
c("Montana", 6.0, 109, 53, 16.4, "B"),
c("Nebraska", 4.3, 102, 62, 16.5, "C"),
c("Nevada", 12.2, 252, 81, 46.0, "C"),
c("New_Hampshire", 2.1, 57, 56, 9.5, "C"),
c("New_Jersey", 7.4, 159, 89, 18.8, "C"),
c("New_Mexico", 11.4, 285, 70, 32.1, "C"),
c("New_York", 11.1, 254, 86, 26.1, "C"),
c("North_Carolina", 13.0, 337, 45, 16.1, "C"),
c("North_Dakota", 0.8, 45, 44, 7.3, "C"),
c("Ohio", 7.3, 120, 75, 21.4, "D"),
c("Oklahoma", 6.6, 151, 68, 20.0, "D"),
c("Oregon", 4.9, 159, 67, 29.3, "D"),
c("Pennsylvania", 6.3, 106, 72, 14.9, "D"),
c("Rhode_Island", 3.4, 174, 87, 8.3, "D"),
c("South_Carolina", 14.4, 279, 48, 22.5, "D"),
c("South_Dakota", 3.8, 86, 45, 12.8, "D"),
c("Tennessee", 13.2, 188, 59, 26.9, "D"),
c("Texas", 12.7, 201, 80, 25.5, "D"),
c("Utah", 3.2, 120, 80, 22.9, "D"),
c("Vermont", 2.2, 48, 32, 11.2, "D"),
c("Virginia", 8.5, 156, 63, 20.7, "D"),
c("Washington", 4.0, 145, 73, 26.2, "D"),
c("West_Virginia", 5.7, 81, 39, 9.3, "D"),
c("Wisconsin", 2.6, 53, 66, 10.8, "D"),
c("Wyoming", 6.8, 161, 60, 15.6, "D")))

i need to convert this into data.frame (or table) with preserving column and rownames, numericity of numbers and convert anything else (in this example column 'Group') into factors. (Data are'nt always in this format, so code has to be general.)

(Optional step is then to remove one column by given name, that's the reason for using data.frame, as it is very easy to do.)

Then, resulting data.frame (or table, or matrix) is passed into 'scale' function.

My solution consists of several steps:

data <- m[-1,-1]
colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']
data <- as.data.frame(data)

now i have data.frame, but it cannot be passed into scale() function ("Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"). If i use data.matrix(data) function, factors are integered fine, but all doubles are converted into integers too. I am stuck at this for hours.

Thank you in advance

I'll move this to an answer, as it seems not working via comments. You can do the following

data <- data.frame(lapply(data.frame(m[-1,-1], stringsAsFactors = FALSE), type.convert))

Which will convert all the columns of the matrix to the correct formats

str(data)
# 'data.frame':  50 obs. of  5 variables:
# $ X1: num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ X2: int  236 263 294 190 276 204 110 238 335 211 ...
# $ X3: int  58 48 80 50 91 78 77 72 80 60 ...
# $ X4: num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
# $ X5: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...

Then, you can set your column/row names as you wish

colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']

For scale you can do

scale(data[-5])

Edit per OPs comment .

As I already said several times, using data.matrix on factor s is simply wrong and it will completely mess up your data. Consider the following example

data.matrix(data.frame(A = factor(c("A", "B")),
                       B = factor(10:11),
                       C = factor(c("22-11-2014", "23-11-2014"))))
#      A B C
# [1,] 1 1 1
# [2,] 2 2 2

data.matrix returned identical results for these completely different values.

Now back to your real data, If you want to avoid running scale on factors and you apriori don't know which columns are factors, you can simply create an index which will identify numeric columns and then run scale only on them, for example

indx <- sapply(data, is.numeric)
scale(data[indx])

Read it as data.frame and do this later

m = data.frame(rbind.... you data here as above)

rownames(m) = m$X1 
colnames(m) = c(t(m[1,]))
req.df  = m[-1,-1]

Below is a quick trial that can preserve numeric and factor types.

# convert into data frame
df <- as.data.frame(m[2:nrow(m), 2:ncol(m)], stringsAsFactors = FALSE)
# set names
names(df) <- m[1, 2:ncol(m)]    
rownames(df) <- m[2:nrow(m), 1]
# convert types into numeric or factor
df[] <- lapply(df, function(x) if(is.na(as.numeric(x[1]))) as.factor(x) else  as.numeric(x))

str(df)
'data.frame':   50 obs. of  5 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : num  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: num  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ Group   : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM