简体   繁体   中英

Why does chisq.test sort data in descending order before summation

Why does chisq.test function in R sorts data before summation in descending order?

The code in question is:

STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible).

I looked into the source code of sum but it did not explain why to pass data in descending order to sum() . What do I miss?

An example:

x = matrix(1.1, 10001, 1)
x[1] = 10^16   # We have a vector with 10000*1.1 and 1*10^16
c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))

The result:

10000000000010996 10000000000011000

When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.

EDIT: The book " Accuracy and stability of numerical algorithms by Nicolas J. Higham " states that

"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."

Thanks @Lamia for sharing the book in the comment section.

This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers.

Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.).

Looking at the R source code for the sum() function which can be obtained from summary.c informs that R implements recursive method in its sum() function.

Also the number of base digits in the floating-point significand is 53, which can be obtained from

.Machine$double.digits
# [1] 53

By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and mpfr() from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book.

I computed chi square statistic using the raw data x .

library('data.table')
library('Rmpfr')
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
df1 <- data.frame(x = x1)
setDT(df1)
df1[, p := rep(1/length(x), length(x))]
s_x <- df1[, sum(x)]
df1[, E := s_x * p]
df1[, chi := ((x - E)^2/E)]

precBits <- .Machine$double.digits
x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc",
                               "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc"))
x_chi$vals <- c( ## x
  df1[order(x), format( sum(x), digits = 22)],
  df1[order(-x), format( sum(x), digits = 22)],
  df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  ## chi
  df1[order(chi), format( sum(chi), digits = 22)],
  df1[order(-chi), format( sum(chi), digits = 22)],
  df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)],
  df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)])

x_chi
#         names                    vals
# 1       x_asc       10000000000011000
# 2      x_desc       10000000000010996
# 3    x_fp_asc 10000000000011000.00000
# 4   x_fp_desc 10000000000020000.00000
# 5     chi_asc    99999999999890014218
# 6    chi_desc    99999999999890030592
# 7  chi_fp_asc 99999999999890014208.00
# 8 chi_fp_desc 99999999999833554944.00

Looking at the source code of edit(chisq.test) function informs that there is no sorting operation involved in it.

Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in chisq.test() function. This function does not accept negative values, so it will spit error by halting the function with this message "all entries of 'x' must be nonnegative and finite" .

set.seed(2L)
chisq.test(c(rnorm(10, 0, 1)))
# Error in chisq.test(c(rnorm(10, 0, 1))) : 
#   all entries of 'x' must be nonnegative and finite

The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the mpfr() function available in Rmpfr package, the sum operation gives identical results regardless of the order of the vector x1 or x2 . However, unequal values are observed when no floating point precision is maintained.

No FP Precision:

x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
## reverse
x2 = matrix(c( rep(1.1, 10000), 10^16 ), 
            nrow = 10001, ncol = 1)

c( format(sum(x1), digits = 22), 
   format(sum(x2), digits = 22))
# [1] "10000000000010996" "10000000000011000"

FP Precision maintained:

library('Rmpfr')
##
prec <- 200
x1 = matrix(c( mpfr( 10^16, precBits = prec),
              rep( mpfr(1.1, precBits = prec), 10000)), 
           nrow = 10001, ncol = 1)

## reverse
x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000), 
              mpfr( 10^16, precBits = prec) ), 
           nrow = 10001, ncol = 1)
c( sum(x1), sum(x2))
# 2 'mpfr' numbers of precision  200   bits 
# [1] 10000000000011000.000000000000888178419700125232338905334472656
# [2] 10000000000011000.000000000000888178419700125232338905334472656

The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation.

.Machine$double.eps
# [1] 2.220446e-16

Comparison of double precision arithmetic aware and unaware functions of chisq.test() function.

The relevant portion of chisq.test() is extracted and a new function chisq.test2() is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using mpfr() function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data.

# modified chi square function:
chisq.test2 <- function (x, precBits) 
{
  if (is.matrix(x)) {
    if (min(dim(x)) == 1L) 
      x <- as.vector(x)
  }

  #before fp precision
  p = rep(1/length(x), length(x))
  n <- sum(x)
  E <- n * p

  # after fp precision
  x1 <- mpfr(x, precBits = precBits)
  p1 = rep(1/length(x1), length(x1))
  n1 <- sum(x1)
  E1 <- n1 * p1

  # chisquare statistic
  STATISTIC <- c(format(sum((x - E)^2/E), digits=22),           # before decreasing
                 format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing
                 sum((x1 - E1)^2/E1),                           # after decreasing 
                 sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing

  return(STATISTIC)
}

# data
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)

chisq.test2(x = x1, precBits=250)

Output:

# [[1]]  # before fp decreasing
# [1] "99999999999890030592"
# 
# [[2]]  # before fp increasing
# [1] "99999999999890014218"
# 
# [[3]]  # after fp decreasing 
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685
# 
# [[4]]  # after fp increasing
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM