char * versus unsigned char * and casting

Question

I need to use the SQLite function sqlite3_prepare_v2() ( https://www.sqlite.org/c3ref/prepare.html ).

This function takes a const char * as its second parameter.

On the other hand, I have prepared an unsigned char * variable v which contains something like this:

INSERT INTO t (c) VALUES ('amitié')

In hexadecimal representation (I cut the line):

49 4E 53 45 52 54 20 49 4E 54 4F 20 74 20 28 63 29
20 56 41 4C 55 45 53 20 28 27 61 6D 69 74 69 E9 27 29

Note the 0xE9 representing the character é .

In order for this piece of code to be built properly, I cast the variable v with (const char *) when I pass it, as an argument, to the sqlite3_prepare_v2() function...

What comments can you make about this cast? Is it really very very bad?

Note that I have been using an unsigned char * pointer to be able to store characters between 0x00 and 0xFF with one byte only.

The source data is coming from an ANSI encoded file.

In the documentation for the sqlite3_prepare_v2() function, I'm also reading the following comment for the second argument of this function:

/* SQL statement, UTF-8 encoded */

What troubles me is the type const char * for the function second argument... I would have been expecting a const unsigned char * instead...

To me - but then again I might be totally wrong - there are only 7 useful bits in a char (one byte), the most significant bit (leftmost) being used to denote the sign of the byte...

I guess I'm missing some kind of point here...

Thank you for helping.

Answer 1

You are correct.

For a UTF-8 input, the sqlite3_prepare_v2 method really should be asking for a const unsigned char * as all 8 bits are being used for data. Their implementation certainly shouldn't be using a signed comparison to check the top bit, because a simple compiler flag can set the default for char to be either unsigned or signed and the former would break the code.

As for your concerns over the cast, this is one of the more benign ones. Casting away signedness on int or float is usually a very bad thing (TM) - or at least a clear indicator that you have a problem.

When dealing with pure ASCII, you are correct that there are 7-bits of data, but the remaining 8th bit is meant to be used for a parity bit, not as a sign bit.

char * versus unsigned char * and casting

Question

1 answers

solution1
1 2015-04-07 14:00:46

char * versus unsigned char * and casting

Question

1 answers

solution1 1 2015-04-07 14:00:46

solution1
1 2015-04-07 14:00:46