简体   繁体   中英

Convert dummy variable from numeric to factor for chi-square test in R

I want to perform chi-square test in R using the following datasets. After perform dummy variable creation. The p-value i get from chi-square test is 1, which is incorrect. I suspect it is because of after dummy variable creation, the data structure change from factor to numeric. This is a hypothesis testing question that wants to check whether the defective % varies by 4 countries center at 5% confidence interval. Please advice what is the possible error and what is the solution.

Subset of datasets used
Phillippines    Indonesia   Malta   India
Error Free  Error Free  Defective   Error Free
Error Free  Error Free  Error Free  Defective
Error Free  Defective   Defective   Error Free
Error Free  Error Free  Error Free  Error Free
Error Free  Error Free  Defective   Error Free
Error Free  Error Free  Error Free  Error Free

The structure of the initial data is factor:

> str(data)
'data.frame':   300 obs. of  4 variables:
 $ Phillippines: Factor w/ 2 levels "Defective","Error Free": 2 2 2 2 2 2 2 2 2 2 ...
 $ Indonesia   : Factor w/ 2 levels "Defective","Error Free": 2 2 1 2 2 2 1 2 2 2 ...
 $ Malta       : Factor w/ 2 levels "Defective","Error Free": 1 2 1 2 1 2 2 2 2 2 ...
 $ India       : Factor w/ 2 levels "Defective","Error Free": 2 1 2 2 2 2 2 2 2 2 …

I convert dummy variable for the following categorical data (error free and defective) by following code:

library(caret)
dmy <- dummyVars("~ .", data = data, fullRank = T)
trsf <- data.frame(predict(dmy, newdata = data))

After dummy variable creation, the data structure of dummy variable turn to numeric:

> str(trsf)
'data.frame':   300 obs. of  4 variables:
 $ Phillippines.Error.Free: num  1 1 1 1 1 1 1 1 1 1 ...
 $ Indonesia.Error.Free   : num  1 1 0 1 1 1 0 1 1 1 ...
 $ Malta.Error.Free       : num  0 1 0 1 0 1 1 1 1 1 ...
 $ India.Error.Free       : num  1 0 1 1 1 1 1 1 1 1 ...

P-value of chi-square is 1

> chisq.test(trsf)   

    Pearson's Chi-squared test

data:  trsf
X-squared = 112.75, df = 897, p-value = 1

Warning message:
In chisq.test(trsf) : Chi-squared approximation may be incorrect

I try apply as.factor and perform chi-square but get the following error:

trsf_2 <- as.factor(trsf)
str(trsf_2)
 Factor w/ 4 levels "c(1, 1, 1, 1, 1, 0, 0, 0, 0, 1)",..: NA NA NA NA
 - attr(*, "names")= chr [1:4] "Phillippines.Error.Free" "Indonesia.Error.Free" "Malta.Error.Free" "India.Error.Free"

> chisq.test(trsf_2)   
Error in chisq.test(trsf_2) : 
  all entries of 'x' must be nonnegative and finite
In addition: Warning message:
In Ops.factor(x, 0) : ‘<’ not meaningful for factors

You could try

dataset <- as.data.frame(lapply(data, as.numeric)) chisq.test(dataset).

However, I am not sure that chi-square is the most appropriate method for binary variables. May I suggest Phi coefficient? You can find information below: https://en.wikipedia.org/wiki/Phi_coefficient .

However, you will need to create a loop if you do not want to do it manually for each set of two variables (ie countries).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM