CSV nominal values' wrong recognition in R

Question

I have numbers declared as a text (aka nominal) in MS Access. These numbers represent simplified versions of what could have been long sentences as categories.

I currently tried exporting the file as csv in three ways:

MS Access' native csv function
MS Excel's native csv function (from MS Access)
LibreOffice Calc's "Use Text CSV Format"; I even check the "Quote all text cells" option to ensure that all nominal values are taken care of.

The problem is in R when I try using summary(data) and see that these number-nominal values are still interpreted as numeric even though the values are enclosed in double or single quotation marks. I am sure of it since I saw these variables given (in the summary function) the mean, median, and others compared to the ones with characters that are shown with frequencies.

In the example below, both var1 and var2 are nominal wherein the latter is represented by numbers (note that the values of var2 results are changed for security).

var1            var2
Cat  : 111   Min.   :1   
Dog  : 222   1st Qu.:1   
Bee  : 333   Median :8   
Yog  : 555   Mean   :10   
Fig  : 999   3rd Qu.:1
Kol  : 444   Max.   :15                                      
(Other):2250

I've thought of appending a character to these number-nominal values (instead of 1, 2, 3, 4, 5 , I'll have 1a, 2a, 3a, 4a, 5a ) to ensure that these are interpreted as nominal but I am hoping for a new solution here before going to that arduous task.

Answer 1

read.table and family have a colClasses argument.

See the following examples to see the difference in the results when using different colClasses :

Sample data

text <- c("A,B,C", "1,2,3", "2,1,4")

Default `read.csv`

A <- read.csv(text = text)
str(A)
# 'data.frame':  2 obs. of  3 variables:
#  $ A: int  1 2
#  $ B: int  2 1
#  $ C: int  3 4
summary(A)
#       A              B              C       
# Min.   :1.00   Min.   :1.00   Min.   :3.00  
# 1st Qu.:1.25   1st Qu.:1.25   1st Qu.:3.25  
# Median :1.50   Median :1.50   Median :3.50  
# Mean   :1.50   Mean   :1.50   Mean   :3.50  
# 3rd Qu.:1.75   3rd Qu.:1.75   3rd Qu.:3.75  
# Max.   :2.00   Max.   :2.00   Max.   :4.00

Read data in as `character`

B <- read.csv(text = text, colClasses = "character")
str(B)
# 'data.frame': 2 obs. of  3 variables:
#  $ A: chr  "1" "2"
#  $ B: chr  "2" "1"
#  $ C: chr  "3" "4"
summary(B)
#     A                  B                  C            
# Length:2           Length:2           Length:2          
# Class :character   Class :character   Class :character  
# Mode  :character   Mode  :character   Mode  :character

Read data in as `factor`

C <- read.csv(text = text, colClasses = "factor")
str(C)
# 'data.frame': 2 obs. of  3 variables:
#  $ A: Factor w/ 2 levels "1","2": 1 2
#  $ B: Factor w/ 2 levels "1","2": 2 1
#  $ C: Factor w/ 2 levels "3","4": 1 2
summary(C)
#   A     B     C    
# 1:1   1:1   3:1  
# 2:1   2:1   4:1

The colClasses argument accepts a vector , so you can specify on a column-by-column basis what the values should be:

D <- read.csv(text = text1, colClasses = c("integer", "character", "factor"))

str(D)
# 'data.frame':  2 obs. of  3 variables:
#  $ A: int  1 2
#  $ B: chr  "2" "1"
#  $ C: Factor w/ 2 levels "3","4": 1 2
summary(D)
#        A             B             C    
#  Min.   :1.00   Length:2           3:1  
#  1st Qu.:1.25   Class :character   4:1  
#  Median :1.50   Mode  :character        
#  Mean   :1.50                           
#  3rd Qu.:1.75                           
#  Max.   :2.00

CSV nominal values' wrong recognition in R

Question

1 answers

solution1
1 ACCPTED 2014-01-01 10:19:37

Sample data

Default `read.csv`

Read data in as `character`

Read data in as `factor`

CSV nominal values' wrong recognition in R

Question

1 answers

solution1 1 ACCPTED 2014-01-01 10:19:37

Sample data

Default read.csv

Read data in as character

Read data in as factor

solution1
1 ACCPTED 2014-01-01 10:19:37

Default `read.csv`

Read data in as `character`

Read data in as `factor`