简体   繁体   中英

CSV nominal values' wrong recognition in R

I have numbers declared as a text (aka nominal) in MS Access. These numbers represent simplified versions of what could have been long sentences as categories.

I currently tried exporting the file as csv in three ways:

  1. MS Access' native csv function
  2. MS Excel's native csv function (from MS Access)
  3. LibreOffice Calc's "Use Text CSV Format"; I even check the "Quote all text cells" option to ensure that all nominal values are taken care of.

The problem is in R when I try using summary(data) and see that these number-nominal values are still interpreted as numeric even though the values are enclosed in double or single quotation marks. I am sure of it since I saw these variables given (in the summary function) the mean, median, and others compared to the ones with characters that are shown with frequencies.

In the example below, both var1 and var2 are nominal wherein the latter is represented by numbers (note that the values of var2 results are changed for security).

var1            var2
Cat  : 111   Min.   :1   
Dog  : 222   1st Qu.:1   
Bee  : 333   Median :8   
Yog  : 555   Mean   :10   
Fig  : 999   3rd Qu.:1
Kol  : 444   Max.   :15                                      
(Other):2250

I've thought of appending a character to these number-nominal values (instead of 1, 2, 3, 4, 5 , I'll have 1a, 2a, 3a, 4a, 5a ) to ensure that these are interpreted as nominal but I am hoping for a new solution here before going to that arduous task.

read.table and family have a colClasses argument.

See the following examples to see the difference in the results when using different colClasses :

Sample data

text <- c("A,B,C", "1,2,3", "2,1,4")

Default read.csv

A <- read.csv(text = text)
str(A)
# 'data.frame':  2 obs. of  3 variables:
#  $ A: int  1 2
#  $ B: int  2 1
#  $ C: int  3 4
summary(A)
#       A              B              C       
# Min.   :1.00   Min.   :1.00   Min.   :3.00  
# 1st Qu.:1.25   1st Qu.:1.25   1st Qu.:3.25  
# Median :1.50   Median :1.50   Median :3.50  
# Mean   :1.50   Mean   :1.50   Mean   :3.50  
# 3rd Qu.:1.75   3rd Qu.:1.75   3rd Qu.:3.75  
# Max.   :2.00   Max.   :2.00   Max.   :4.00  

Read data in as character

B <- read.csv(text = text, colClasses = "character")
str(B)
# 'data.frame': 2 obs. of  3 variables:
#  $ A: chr  "1" "2"
#  $ B: chr  "2" "1"
#  $ C: chr  "3" "4"
summary(B)
#     A                  B                  C            
# Length:2           Length:2           Length:2          
# Class :character   Class :character   Class :character  
# Mode  :character   Mode  :character   Mode  :character  

Read data in as factor

C <- read.csv(text = text, colClasses = "factor")
str(C)
# 'data.frame': 2 obs. of  3 variables:
#  $ A: Factor w/ 2 levels "1","2": 1 2
#  $ B: Factor w/ 2 levels "1","2": 2 1
#  $ C: Factor w/ 2 levels "3","4": 1 2
summary(C)
#   A     B     C    
# 1:1   1:1   3:1  
# 2:1   2:1   4:1

The colClasses argument accepts a vector , so you can specify on a column-by-column basis what the values should be:

D <- read.csv(text = text1, colClasses = c("integer", "character", "factor"))

str(D)
# 'data.frame':  2 obs. of  3 variables:
#  $ A: int  1 2
#  $ B: chr  "2" "1"
#  $ C: Factor w/ 2 levels "3","4": 1 2
summary(D)
#        A             B             C    
#  Min.   :1.00   Length:2           3:1  
#  1st Qu.:1.25   Class :character   4:1  
#  Median :1.50   Mode  :character        
#  Mean   :1.50                           
#  3rd Qu.:1.75                           
#  Max.   :2.00                           

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM