简体   繁体   中英

Implementing a Character Encoding in Java

I was asked this question during an interview with a famous IT company. They asked me to suggest how a character encoding will be implemented if we have lots of characters & 16 bits of Unicode are not enough. I answered we can implement 64 bit encoding for characters. They said, even it's not enough, to which I suggested to implement a encoding via java BigInteger .

Then they asked the encoding should be such that it only takes the bits that are needed. Like ASCII representation of A is 01000001 , we should not be using the leading 0 because we don't need it and we are wasting memory. I could not give an answer to it. If you could please tell me about how to approach this problem and how it is handled.

See the Unicode Standard, Chapter 3: "The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. The size of the code unit is specified for each encoding form. This section presents the formal definition of each of these encoding forms."

As regards the question on saving bits, this is meaningful only when the text is very large, in which case I would suggest using compression, such as zip. There are solutions in various languages that let you read from and write to a compressed file directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM